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ABSTRACT 


There  is  a  broad  spectrum  of  design  styles  that  have  proven  successful  for  the  construction 
of  VLSI  circuits  and  systems.  Semi-custom  to  full-custom  design  styles  offer  a  wide  ranges  of 
resulting  performance,  expected  tum-around  time,  and  required  design  effort.  Implementation 
alternatives,  such  as  replacing  dynamic  memory  for  static  memory  to  implement  a  denser  on-chip 
memory,  also  exist  at  all  levels  of  design  hierarchy.  To  make  the  best  use  of  scarce  resources  on 
a  single  chip  microprocessor  and  to  make  the  emerging  CAD  tools  truly  useful,  alternatives  in  the 
implementation  of  a  microprocessor  must  be  carefully  evaluated.  The  research  reported  in  this 
thesis  focuses  on  issues  concerning  these  alternatives,  especially  in  the  areas  of  on-chip  memory 
design  and  automated  control  logic  design. 

The  methodologies  and  techniques  used  to  maximize  the  performance  of  a  full-custom 
VLSI  microprocessor,  called  the  SPUR  CPU,  is  initially  presented  to  provide  an  overview  of 
microprocessor  design  strategies.  The  rest  of  the  research  presented  is  transpired  from  new  ideas 
and  better  alternatives  which  have  become  available  since  the  SPUR  CPU.  Tliese  are  based  on 
lessons  learned  in  the  SPUR  design  and  advanced  computer-aided  design  tools  such  as  multi¬ 
level  logic  synthesis  system.  A  rigorous  evaluation  of  tlicse  alternatives  is  attempted  and  results 
from  the  evaluation  establish  the  effectiveness  of  the  alternatives  considered. 

To  increase  the  area  efficiency  of  llie  on-chip  memory,  two  memory  design  techniques  are 
proposed  and  evaluated.  Selective  invalidation  instead  of  refreshing,  implemented  using  low 
overhead  dynamic  CMOS  circuits,  can  effectively  eliminate  the  refreshing  requirement  of 


dynamic  memory.  With  this  scheme,  the  size  of  an  on-chip  local  memory  can  be  substantially 
increased  without  increasing  the  scarce  silicon  area.  Trace -driven  simulations  show  the  effective¬ 
ness  of  this  scheme  over  a  simple  invalidation  scheme.  The  demand  for  high  bandwidth  local 
memory  expedited  by  parallel  execution  of  programs  through  multiple  functional  units  requires  a 
fast,  stable,  yet  compact  multi-port  memory  cell.  A  single-ended  access,  static  memory  cell 
operated  at  reduced  voltage  levels  is  proven  to  be  useful  for  such  applications. 

A  part  of  this  research  is  devoted  to  investigating  various  layout  styles  for  microprocessor’s 
control.  Recently,  various  VLSI  CAD  tools  have  emerged  to  facilitate  the  hard-wired  control 
design.  Behavioral  synthesis  and  multi-level  logic  optimization  systems  provide  particularly 
efficient  and  high-performance  hard-wired  logic  implementation,  even  with  semi-custom  layout 
styles,  such  as  standard  cell-based  design.  All  new  design  methods  aim  for  simplicity  and  regular¬ 
ity.  The  standard  cell  based  design  style,  when  combined  with  multi-level  logic  optimization, 
can  provide  a  resulting  design  as  good  as  full-custom  version  but  in  much  shorter  design  time. 
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Introduction 


This  thesis  consists  of  three  self-contained  chapters  that  examine  design  and  implemen¬ 
tation  of  a  VLSI  microprocessor  chip  (the  SPUR  CPU),  on-chip  memory  design  techniques, 
and  alternative  implementations  of  microprocessor’s  control  logic.  Chapters  2,  3  and  4  are 
stand-alone  presentations.  In  this  introductory  chapter,  I  provide  an  overview  of  the  research 
and  thesis  organization. 

1.1.  Motivation 

Tliere  is  a  broad  spectrum  of  design  styles  (e.g.  semi-custom  or  full-custom  design 
styles,  and  static  or  dynamic  memory  for  on-chip  local  memory  design)  that  have  proven  suc¬ 
cessful  for  the  constiuction  of  VLSI  circuits  and  systems.  For  all  these  styles  and  for  all  the 
abstraction  levels  (behavior,  logic,  circuits,  and  layout)  in  the  design  hierarchy,  good  com¬ 
puter  aided  design  (CAD)  tools  are  indispensable.  To  make  the  emerging  CAD  tools  truly 
useful  and  to  take  advantage  of  advanced  VLSI  technology,  all  of  the  many  different  design 
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styles  must  be  carefully  examined  with  choices  made  judiciously  for  a  particular  application. 

The  research  reported  in  this  thesis  is  originally  motivated  from  the  design  and  imple¬ 
mentation  of  the  SPUR  CPU  chip.  After  presenting  the  details  of  its  design  and  implementa¬ 
tions,  I  will  examine  several  alternatives  for  a  full-custom  VLSI  microprocessor  design.  Con¬ 
sequently,  a  detailed  evaluation  of  alternatives  wiD  be  available  for  future  microprocessor 
development. 

The  objectives  of  the  research  are  as  follows: 

(1)  Develop  a  better  understanding  of  implementation  alternatives  in  VLSI  microprocessor 
design. 

(2)  Examine  alternatives  rigorously,  in  order  to  be  able  to  compare  and  evaluate  them 
quantitatively. 

(3)  Provide  ideas  or  guidelines  for  future  microprocessor  design  and  for  the  development  of 
computer-aided  design  tools. 

1.2.  VLSI  Microprocessor  Design  and  Implementation 

Over  the  last  decade,  many  commercial  and  research  microprocessors  have  been  suc¬ 
cessfully  built  on  a  single  chip.  A  general  overview  of  design  steps  in  the  development  of  the 
microprocessor  is  summarized  in  the  following.  Among  various  steps,  this  research  focuses 
on  the  VLSI  design  issues,  particularly  on  alternatives  for  a  full-custom  implementation  of 
the  VLSI  microprocessor. 

Architecture  definition 

High  level  arcliitecturc  and  instruction  set  is  defined  in  this  step.  Global  design  issues 
such  as  language  support,  operating  system  support,  memory  management  support  and 
coprocessor  support  are  considered  and  defined. 

Technology  selection 
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The  performance  of  the  microprocessor  strongly  depends  on  the  technology  in  which 
the  microprocessor  is  implemented.  Emitter-coupled  logic  (ECL),  CMOS,  and  NMOS 
technologies  are  among  the  most  popular  choices.  The  selection  also  strongly  relies  on 
the  design  environment  supported  by  CAD  tools,  since  some  CAD  tools  may  only  work 
with  a  particular  technology.  The  design  style  associated  with  each  technology  is  also 
considered  here.  The  selection  of  design  style  greaUy  influences  design  cost  and  tur¬ 
naround  time.  Semicustom  design,  like  the  gate  array  design  style,  may  require  a  few 
weeks  to  get  the  first  working  silicon,  while  highly  optimized  full  custom  design  may 
take  several  years.  Other  semicustom  design  styles  include  sea  of  gates  (channel-less 
gate  array),  standard  cell,  and  macro  cell  design  styles. 

Microarchitecture  design 

An  important  task  in  this  step  is  to  specify  a  detailed  behavioral  description  of  the 
architecture.  Since  it  represents  a  complete  design  of  the  microprocessor  for  a  selected 
technology,  it  is  used  to  verify  architecture  defined  at  a  higher  level  as  well  as  to  pro¬ 
vide  diagnostic  vectors  for  later  stages  of  design  verification  and  debugging. 

VLSI  design  and  implementation 

The  chip  implementing  the  microarchitecture  is  designed  in  this  step.  First,  the 
behavioral  description  is  synthesized  into  several  different  modules,  such  as  control 
logic,  data  path,  and  memory.  Since  implementation  styles  for  each  of  these  modules 
are  very  different,  different  implementation  strategies  are  used.  Design  methodology  for 
each  of  these  modules  is  chosen  so  that  the  best  overall  performance  can  be  achieved. 

Design  verification 

Once  schematics  or  a  layout  representation  of  the  micro-architecture  is  made,  this 
representation  is  verified  against  the  behavioral  description,  to  make  sure  tliat  the  two 
representations  are  totally  equivalent.  Usually  a  net  list  is  generated  from  the  schematic 
diagram  or  mask  layout,  and  then  logic  simulations  are  performed  on  this  extracted  net 
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list  and  results  are  verified  against  the  behavioral  simulation.  Design  and  electrical  rules 

are  also  checked  in  this  step. 

Integrated  circuits  fabrication 

Masks  are  made  from  the  layout  and  the  design  is  transformed  into  the  silicon. 

Testing 

Testing  verifies  functional  behavior,  electrical  performance  and  fabrication  processes. 

Test  vectors  are  generated  so  that  they  can  cover  as  much  area  of  the  chip,  and  as  many 

functionalities,  as  possible. 

1.3.  A  VLSI  Design  Methodology  for  a  Microprocessor 

Optimizing  performance  in  VLSI  digital  systems  involves  several  design  choices, 
including  the  choice  of  the  best  implementation  methodology.  Alternatives  exist  at  all  levels 
of  design  and  each  must  be  carefully  examined  to  obtain  an  optimal  implementation  for  a 
given  architecture  and  technology.  Because  some  of  these  processes  are  very  time- 
consuming,  designers  rely  on  structured  methods  aided  by  a  computer. 

Many  design  strategies  are  used  to  deal  with  complexities  in  VLSI  design 
[MeC80][WeA85].  One  of  the  more  frequently  used  strategics  is  to  divide  the  design  into 
several  parts  so  that  each  can  be  implemented  using  the  most  efficient  method.  This  would 
result  in  an  optimal  implementation  in  part  by  part.  A  balanced  optimization  is  also  important 
in  such  strategy,  since  overall  performance  is  usually  determined  by  the  most  critical  part. 
Various  tradeoffs  among  different  parts,  such  as  area  versus  timing,  can  be  made  to  achieve 
the  best  overall  performance. 

In  general,  microprocessor  implementation  can  be  divided  into  three  different  activities: 
data  path  design,  control  logic  design,  and  on-chip  local  memory  design.  All  these  parts  are 
different  and  require  different  methodologies  as  well  as  different  design  styles.  Alternatives 
and  optimization  techniques  for  tlic  data  path  part  of  the  microprocessor  have  been  studied 
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extensively  in  many  past  research  projects.  On-chip  memory  design  has  become  an  important 
issue  since  Vo  communication  botUeneck  of  a  single-chip  processor  can  be  significantly 
improved  by  having  an  on-chip  local  memory.  As  more  chip  area  is  devoted  to  the  local 
memory  (some  microprocessors  have  more  memory  than  other  logic,  e.g.  TI  Lisp  chip 
[Bos87])  and  many  different  memory  organizations  emerges,  a  separate  (from  data  path 
design)  consideration  is  required  for  on-chip  memory  designs.  The  research  presented  in  this 
thesis  addresses  two  separate  issues  regarding  on-chip  memory  design  and  control  logic 
design  parts  of  the  full -custom  VLSI  microprocessor. 

1.4.  Related  Work 

The  research  presented  in  Chapter  2  on  the  SPUR  CPU  chip  is  not  a  one-person  project. 
Several  graduate  students  have  worked  on  various  aspects  of  the  research.  The  instruction  set 
architecture  of  the  SPUR  CPU  was  defined  by  George  S.  Taylor  [THL86],  and  the  microar¬ 
chitecture  design  was  refined  by  Shing  I.  Kong  [Kon89].  Mark  D.  Hill  contributed  in  the 
design  of  on-chip  instruction  cache  [Hil87].  Most  of  the  work  presented  in  Chapter  2  of  this 
thesis  is  in  the  area  I  have  participated  in  most,  VLSI  chip  design  and  implementation. 

Many  papers  have  discussed  on-chip  memory  design  at  both  architecture  and  imple¬ 
mentation  aspects;  these  include  [Goo83],  [HiS84],  [ACH87],  [EiP88],  [GoH86],  and 
[Kad82].  Most  concentrate  on  architectural  design  issues  such  as  register  versus  cache  and 
organization  of  on-chip  caches.  Agarwal  et  al.  present  the  importance  of  the  tradeoffs 
between  on-chip  cache  arclutccture  and  implementation  [ACH87].  They  show  that  for  on- 
chip  caches  other  considerations  besides  hit  rate  are  important.  These  include  the  total  usable 
area,  the  timing  of  cache  accesses,  tlie  physical  organization  of  the  cache,  and  the  aspect  ratio 
of  the  resulting  design. 

Two  recent  papers  discuss  using  dynamic  memory  for  on-chip  local  memories 
[Tra85][Bos87].  Tran  presents  a  successful  integration  of  high  density  IT  DRAM  in  digital 
signal  processor  chip  [Tra85].  Bosshart  et  al.  also  present  a  memory  intensive 


5 


microprocessor  chip  built  for  LISP  processing  [Bos87].  Over  eighty  percent  of  this  chip  is 
used  to  implement  memories,  including  RAM’s  made  of  4T  dynamic  cells.  The  DRAM  s 
refresh  when  they  are  not  required  for  other  operations.  A  master  refresh  timer  is  also  pro¬ 
vided  to  enforce  extra  refresh  cycles  in  case  that  there  is  any  entry  not  refreshed.  Several 
multi-port  memory  cells  for  on-chip  memories  have  been  proposed  in  [She84],  [Kad82], 
[0’C87],  [DiS79],  and  [Nak88].  The  analysis  of  these  cells  is  further  carried  out  in  [SLL87], 
[0’C87],  and  [Nak88]. 

The  efficiency  of  two  general  approaches  (microprogrammed  and  hard-wired)  for 
designing  the  control  unit  of  a  VLSI  microprocessor  is  examined  in  [Anc83].  By  re¬ 
implementing  the  control  unit  of  MC6800,  this  research  shows  that  the  hard-wired  approach 
always  gives  minimum  area  but  its  design  cost  increases  too  rapidly  with  increasing  com¬ 
plexity.  The  aim  of  reducing  the  design  cost  may  lead  designers  to  choose  design  styles  less 
optimal  in  terms  of  silicon  area  but  which  use  more  regular  structure. 

For  alternative  control  design  using  hard-wired  logic,  Hoffman  compares  the  multi¬ 
level  logic  implemented  in  array  structured  logic  (a  CMOS  extension  of  Weinberger  array 
using  domino  logic),  and  a  two-level  PLA  implementation  [Hof85].  The  control  logic  of  the 
CMOS  SOAR  (SmallTalk  on  A  RISC  [Pcn87])  microprocessor  chip  is  used  as  the  basis  of 
comparison.  He  shows  that  a  multi-level  logic  implemented  in  array  form  is  faster  and 
smaller  than  a  PLA  version  of  the  same  logic.  The  impact  of  library  size  on  the  quality  of 
automated  logic  synthesis  is  investigated  in  [Keu87].  It  concludes  that  an  incrementally 
larger  library  size  can  considerably  reduce  area  while  meeting  comparable  timing  require¬ 
ments. 

1,5.  Thesis  Organization 

The  main  body  of  tlie  thesis  consists  of  three  main  chapters  (Chapters  2,  3  and  4)  which 
arc  written  as  stand-alone  discussions  rather  than  as  tightly  integrated  parts  of  a  whole.  For 
this  reason,  the  chapters  can  be  read  separately,  in  any  order. 
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Chapter  2.  VLSI  Implementation  of  the  SPUR  CPU  Chip,  presents  the  design  of  a  VLSI 
microprocessor  chip  for  a  multiprocessor  woricstation,  called  SPUR.  The  central  processing 
unit  (CPU)  of  the  SPUR  processor  supports  a  multilevel  cache  scheme  that  includes  a  pre¬ 
fetching  on-chip  instruction  cache,  a  coprocessor  interface,  and  a  support  for  a  fast  execution 
of  LISP  through  a  tagged  40-bit  architecture  [Hil86].  In  addition  to  describing  the  implemen¬ 
tation  details  of  each  part,  an  overall  methodology  is  also  presented.  In  order  to  build  a  work¬ 
ing  computer  system  based  on  the  SPUR  CPU  chip,  a  reliable  and  efficient  methodology  was 
indispensable. 

The  research  presented  in  the  next  two  chapters  (Chapter  3  and  4)  is  developed  from  the 
implementation  of  the  SPUR  CPU.  New  ideas  and  better  alternatives  have  become  available 
since  the  SPUR  CPU,  from  the  lessons  learned  in  the  SPUR  design  and  the  newly  developed 
CAD  tools.  These  must  be  examined  rigorously  to  be  useful  for  improving  the  performance 
of  the  next  generation  microprocessors.  New  and  better  alternatives  in  VLSI  microprocessor 
design  are  presented  using  examples  from  the  SPUR  CPU  chip,  and  comparisons  are  made  to 
determine  the  effectiveness  of  the  alternative. 

Chapter  3,  On-chip  Memory  Design,  presents  new  techniques  for  on-chip  local  memory 
designs.  Simple  circuit  design  techniques,  when  properly  adapted  to  the  architectural  design, 
can  provide  a  cost-effective  performance  improvement.  Using  the  selective  invalidation 
scheme  implemented  with  low  overhead  circuits  can  eliminate  the  refreshing  requirement  of 
dynamic  memory,  if  used  as  a  read-only  or  write-through  cache.  Using  this  scheme,  a  static 
memory  can  be  replaced  with  a  high  density  dynamic  memory  without  performance  or  relia¬ 
bility  degradations.  Parallel  execution  of  programs  using  more  than  one  functional  unit  is  an 
effective  approach  to  increasing  the  processor  performance  [P1S88].  The  bandwidth  required 
by  multiple  functional  units  demands  a  liigh  bandwidth  fast  memory  with  multiple  ports.  A 
single-ended  static  memory  cell  operated  at  reduced  voltage  levels  can  be  as  safe  and  fast  as 
other  multiport  memory  while  consuming  much  less  area.  Using  circuit  simulations  and  static 
noise  margin  analysis  [SLL87],  it  is  determined  to  be  feasible  to  implement  multi-port 
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memory  based  on  this  cell. 

Chapter  4.  Control  Design  Alternatives,  discusses  the  alternatives  available  for  design¬ 
ing  the  control  portion  of  the  microprocessor.  A  common  approach  to  regularizing  the  design 
of  random  control  logic  employs  a  structured  logic  element,  such  as  PLAs  and  microcode 
ROMs,  to  implement  the  microprocessor’s  control.  Emerging  CAD  tools,  especially  in 
multi-level  logic  synthesis  and  optimization  [NeS86][Bra87],  now  allow  a  combinatorial  por¬ 
tion  of  control  logic  to  be  mapped  into  different  design  styles  such  as  standard  cell-based 
design,  which  have  not  been  well  utilized  in  full-custom  VLSI  designs.  In  this  chapter,  I 
examine  these  alternative  design  styles  by  re-implementing  the  control  units  from  the  SPUR 
chips,  and  contrasting  them  with  the  full-custom  version  (with  only  PLA  synthesis  tools) 
available  also  from  the  SPUR  designs. 

Finally,  Chapter  5  concludes  the  thesis  and  provides  a  summary  of  the  research  and 
future  woric. 
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VLSI  Implementation  of  the  SPUR  CPU  Chip 


2.1.  Introduction 

SPUR*  (Symbolic  Processing  Using  RISCs)  is  a  multiprocessor  workstation  developed 
at  the  University  of  California  at  Berkeley  as  a  testbed  for  research  on  parallel  processing, 
particularly  in  LISP  [Hil86].  A  SPUR  workstation,  shown  in  Figure  2-1,  can  have  6  to  12 
identical  processors,  each  of  which  consists  of  a  128K-byte  cache,  a  CPU,  a  floating  point 
coprocessor,  and  a  cache  control  and  memory  management  unit  (CMU)  that  assures  the  cache 
coherency  among  multiple  processors.  The  picture  of  a  fully  populated  SPUR  processor 
board  is  shown  in  Figure  2-2.  This  chapter  describes  the  VLSI  implementation  of  the  CPU 
chip,  a  32-bit  RISC  microprocessor. 

Tlie  SPUR  CPU  supports  a  multilevel  cache  scheme  that  includes  a  prefetching  on-chip 
instruction  cache,  a  coprocessor  interface,  and  support  for  fast  execution  of  LISP  through  a 

'SPUR  is  sponsored  by  DARPA  under  contract  order  482427-25840,  California  MICRO,  Texas  Instruments,  National 
Semiconductor,  Cypress  Semiconductor,  Tektronix,  and  HP. 
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Figure  2-1. 


multiprocessor  workstation 


Figure  2-2.  A  SPUR  processor  board 
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tagged  40  bit  architecture.  The  coprocessor  interface  supports  concurrent  CPU  and  FPU 
operations.  It  uses  27  pins  to  implement  a  low-overhead  interface  between  the  CPU  and  the 
FPU. 

The  chip,  implemented  in  1.6  pm,  double  metal  CMOS  technology,  contains  115K 
transistors.  The  chip  statistics  are  summarized  in  Table  2-1,  and  a  chip  photomicrograph  is 
shown  in  Figure  2-3.  An  on-chip  clock  generator,  based  on  a  charge  pump  phase-locked  loop 
with  tapped  delay-  line,  provides  accurate  phase  relationship  with  the  board  clock  and  also 
with  clock  phases  of  the  other  chips  [Jeo87].  Nominal  operating  frequency  with  a  4-phase 
non-overlapping  clock  (18  nsec  nominal  per  phase  and  7  nsec  non-overlap  time  between 
phases)  is  10  MHz  (12.5  MHz  Max).  A  SPUR  uniprocessor  running  LISP  programs  (Gabriel 
benchmarks)  at  10  MHz  can  provide  2X  performance  improvement  on  the  average  over  the 
Symbolics  3600  or  VAX  8650,  according  to  simulation  [THL86].  A  SPUR  workstation  with 
6  to  12  processors  is  predicted  to  yield  a  sustained  throughput  of  40  to  70  MIPS,  respectively. 


Number  of  Transistors 

115,214 

Number  of  PLA’s 

13 

Die  Size 

11.5mm  X  11.5mm 

Package 

208-pin  PGA  (40  pins  for  power  supply) 

Process 

Double -Metal  1.6iim  N-Well  CMOS 

Operating  Frequency 

lOMHz 

Power  Dissipation 

0.8W  at  lOMHz  with  5V  Supply 

Table  2-1.  Chip  Statistics 
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Figure  2-3.  Chip  microphotograph 
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The  organization  of  the  chapter  is  as  follows:  Section  11  gives  an  overview  of  the  CPU 
architecture  and  execution  pipeline.  Section  III  focuses  on  the  hardware  required  to  imple¬ 
ment  various  features  of  the  SPUR  CPU  architecture.  Section  IV  describes  the  design, 
verification,  and  testing  methodologies  of  the  full  custom  SPUR  CPU  chip.  Finally,  the  sum¬ 
mary  and  conclusion  are  given  in  section  V. 

2.2.  An  Overview  of  the  SPUR  CPU  Architecture 

The  SPUR  CPU  is  a  third-generation  RISC  microprocessor  developed  at  the  University 
of  California  at  Beiiceley.  It  is  specifically  designed  to  be  used  in  the  SPUR  multiprocessor 
workstation.  The  architecture  of  the  SPUR  CPU  is  akin  to  those  of  previous  RISC  projects  at 
U.C.  Berkeley  [Kat83],  [Pen87].  Some  new  features,  however,  have  been  added;  a  coproces¬ 
sor  interface  to  support  floating-point  computation,  an  efficient  interface  to  the  cache-control 
and  memory-management  unit,  and  run-time  hardware  tag  checking  for  fast  execution  of 
LISP  programs.  The  instruction  set  of  the  SPUR  CPU  is  carefully  chosen  such  that  an 
efficient  implementation  of  the  single-cycle  execution  of  all  instructions  is  possible. 

Like  previous  RISC  processors,  the  SPUR  CPU  is  a  load-store  machine.  Memory  is 
accessed  only  through  load  and  store  instnictions.  All  other  instructions  are  register-to- 
register  or  immcdiate-to-register  oriented.  There  are  four  generic  instruction  types:  register- 
to-register,  store,  compare-and-branch,  and  call-jump.  Load  and  return  instructions  are  spe¬ 
cial  cases  of  regisler-to-register  in  which  iR,i  +R,^  or  {R,i+ Immediate)  is  used  as  an  effec¬ 
tive  address.  The  field  specifies  the  register  to  be  loaded  for  the  load  instruction  type  and 
is  not  used  for  tlie  return  instruction  type.  AH  instructions  (40  integer  and  20  floating  point) 
are  32-bits  wide  and  use  fixed  formats.  The  seven  instruction  fonnats  arc  shown  in  Figure  2- 
4.  The  opcode  and  tlie  register  specifiers  are  in  the  same  positions  in  all  formats.  Tlie  three- 
register  format  (RRR)  is  used  for  loads,  rcgistcr-to-register  operations,  special  register  opera¬ 
tions  and  coprocessor  operations.  The  two-register  and  one-immediate  (RRI)  is  used  for  loads 
and  register-to-register  operations.  Compare-and-branch  instructions  have  three  slightly  dif- 
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ferent  formats  depending  on  the  field  specifying  the  condition. 


Register-Register:  Rd,  Rsl,  Rs2 _ 

- i  i  i  i  i 

otxx)de  I  Rd  i  Rsl  lOi  Rs2  i  unused 

_ _ _ I _ 1 _  I _ I - 1 - — — J 

31  24  19  14  8  0 


Register-Register  Rd,  Rsl,  Immediate 

- - j  I  I  f  —  ■ 

opcode  I  Rd  i  Rsl  'li  Immediate 

_ 1 _ I _ j _ 1 — 1 - — — - 

31  24  19  14  0 


Store:  Rs2,  Rsl,  Immediate 


opcode  I  High  Imm  i  Rsl 


I  I 

I  li 
J _ L. 


Rs2 


Low  Imm 


31 


24 


19 


14 


Compare-Branch:  Rsl,  Rs2 

- i  i  i  i  i 

opcode  I  Cond  i  Rsl  iQi  Rs2  i  Branch  Offset 

_ 1 _ I _ \ _ I - 1—1 - L - 

31  24  19  14  8  0 

Compare-Branch:  Rsl,  Short  Imm 

- i  i  i  i  i 

opcode  I  Cond  i  Rsl  ili  Short  Imm  i  Branch  Offset 

_ 1 _ I _ \ _ _ _ I — I - 1 - 

31  24  19  14  8  0 

Compare-Branch:  Rsl,  Tag  Imm 

i  i  i  i 

opcode  I  Cond  i  Rsl  i  Tag  Imm  i  Branch  Offset 

31  24  19  14  8  0 

Call,  Jump:  Word  Address  _ _ _ 

( 

opcode  I  Word  address  wilhin  current  segment 

31  27  0 


Figure  2-4.  SPUR  instruction  formats 
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The  CPU  registers  are  organized  in  eight  overlapped  windows  (128  registers)  and  10 
global  registers  accessible  from  any  window  (total  32  registers  visible  from  one  window). 
The  overlapped  window  scheme  considerably  reduces  the  register  save  and  restore  overheads 
between  procedure  calls.  The  registers  are  40-bit  registers  with  32  bits  for  data  and  an  8-bit 
tag  used  for  runtime  type  checking  and  garbage  collection.  The  8-bit  tag  consists  of  a  6-bit 
object’s  type  tag  and  a  2-bit  generation  numbers.  LISP  is  supported  with  three  types  of 
hardware  tag  checking  with  traps  to  a  software  trap  handler;  data  type  checking  for  general 
computations,  pointer  type  checking  for  list  operations,  and  generation  number  checking  for 
garbage  collection  based  on  the  generation  scavenging  algorithm  [Ung84]. 

The  on-chip  instruction  cache  provides  the  effect  of  an  extra  memory  port,  allowing 
simultaneous  data  memory  reference  and  instruction  fetch  by  the  execution  unit  (EU).  This 
leads  to  a  four-stage  pipeline  (Figure  2-5)  that  eliminates  the  need  for  pipeline  stalling  when¬ 
ever  a  load  instruction  is  executed.  Consequently,  the  CPU  can  issue  and  complete  one 
instruction  per  cycle  (peak  performance  rate  of  10  MIPS  per  processor)  as  long  as  there  are 
no  instruction  or  external  data  cache  misses.  Branch  conflict  in  the  pipeline  is  resolved  by  a 
single  cycle  delayed  branch  with  one  instruction  in  the  delayed  slot.  Data  conflicts  are 
resolved  by  hardware  internal  forwarding  logic. 

In  order  to  facilitate  the  high-precision  floating  point  computations  and  other  possible 
coprocessing  capabilities,  the  SPUR  CPU  incorporates  a  parallel  interface  to  coprocessors. 
The  floating-point  coprocessor  interface  implemented  in  the  current  version  of  the  CPU  chip 
supports  concurrent  CPU  and  FPU  operations.  It  uses  27  pins  to  implement  a  low-overhead 
interface  between  the  CPU  and  the  FPU.  The  FPU  tracks  CPU  instructions  issued  by  the 
instruction  cache  in  the  CPU  via  22  pins  carrying  opcode  and  register  specifiers.  The  CPU 
sends  2  control  signals  to  the  FPU,  and  the  3-bit  FPU  status  is  sent  to  the  CPU.  The  CPU 
treats  all  FPU  instructions  as  illegal  instructions  when  the  FPU  is  disabled.  Wiren  the  FPU  is 
enabled,  all  FPU  instructions  except  FPU  load  and  store  are  treated  by  the  CPU  as  NO_OP. 
For  FPU  load  and  store,  the  CPU  computes  the  effective  memory  address  and  the  FPU  reads 
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Figure  2-5.  SPUR  CPU  pipeline 


and  writes  the  data  directly  from  the  external  cache. 

In  the  SPUR  instruction  set,  a  number  of  special  load  (7)  and  store  (3)  instructions  are 
dedicated  to  cache  control  and  virtual  memory  management.  Although  these  instructions  look 
almost  identical  to  the  CPU,  appropriate  cache  operations  are  provided  to  the  external  CMU 
through  the  CMU  interface.  The  interface  consists  of  a  4-bit  cache-opcode,  two  bits  indicat¬ 
ing  the  mode  of  operations  (user  vs.  kernel  and  physical  vs.  virtual),  and  9  other  status  bits  of 
both  the  CPU  and  the  CMU. 

The  unusual  conditions  that  the  CPU  may  face  at  runtime  can  be  divided  into  four 
groups.  Unusual  conditions  detected  inside  the  CPU  are  called  CPU  exceptions;  integer 
overflow,  tag  checking,  window  overflow  and  underflow,  and  so  on.  Unusual  conditions 
caused  by  the  FPU  arc  called  floating-point  exceptions.  AU  other  unusual  conditions  occur¬ 
ring  outside  the  CPU  arc  called  faults  and  interrupts.  Faults  occur  in  response  to  the  execu¬ 
tion  of  an  instruction,  while  Interrupts  are  asynchronous  events  that  come  from  outside  the 
processor  (e.g.  an  i/o  intenupt).  The  CPU  responds  to  exceptions,  faults,  and  interrupts  by 
taking  a  vectored  trap.  Tlic  trap  vector  consists  of  a  trap  base  address  concatenated  with  the 
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trap  type  field.  There  is  a  priority  ordering  for  cases  when  more  than  one  unusual  condition 
occurs  at  the  same  time.  All  traps  are  taken  during  an  instruction’s  third  pipeline  stage,  and 
hence  only  one  instruction  can  cause  a  trap  in  any  cycle.  Traps  can  be  disabled  or  enabled 
selectively  by  controlling  the  8  bits  in  both  kernel  and  user  processor  status  words  (KPSW 
and  UPSW). 

2.3.  Hardware  Implementation  of  the  SPUR  CPU 

The  major  functional  blocks  are  shown  in  Figure  2-6  and  outlined  in  the  chip  photomi¬ 
crograph  (Figure  2-3).  The  major  blocks  are  the  execution  unit  (EU)  and  the  instruction  unit 
(lU).  The  EU  is  further  divided  into  the  upper  data  path,  the  lower  data  path,  and  the  control. 
The  30-bit  upper  data  path  contains  pipelined  program  counters  and  special  registers.  It  is 
used  for  instruction  address  calculations  and  special  register  references.  The  40-bit  lower  data 
path  is  for  general  computation  on  the  tagged  registers. 

2.3.1.  The  instruction  unit 

The  SPUR  lU  consists  of  a  512-byte  (128  instructions)  direct-mapped  (16  blocks  with  8 
subblocks  or  8  instructions  per  block)  instruction  cache.  A  novel  feature  of  the  SPUR  instruc¬ 
tion  cache  is  a  valid  bit  associated  with  each  instruction  word  in  the  cache  so  that  any  subset 
of  instructions  within  a  block  may  be  valid.  The  SPUR  lU  uses  this  flexibility  to  reduce 
demand  miss  time  by  loading  only  the  fetched  instruction  rather  than  the  entire  block  and  to 
permit  instruction  prcfctching  to  load  the  rest  of  a  block  in  parallel  with  subsequent  instruc¬ 
tion  fetches  [Goo83],  [HiS84].  If  subsequent  prefetches  are  successful,  the  miss  penalty  is 
just  two  cycles  for  the  entire  block  containing  the  missed  instruction. 

The  lU  can  operate  in  three  different  modes:  (1)  disabled,  (2)  enabled-without- 
prcfclching,  and  (3)  enabled-with-prcfetching,  controlled  by  two  bits  in  the  kernel  processor 
status  word  (KPSW).  In  disabled  mode,  the  lU  fetches  every  instruction  requested  by  tlie  EU 
from  the  external  cache.  Disabled  mode  is  useful  for  initial  chip  testing  and  for  allowing 
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ADDRESS  BUS 


DATA  BUS 


Figure  2-6.  SPUR  CPU  block  diagram 


chips  with  stuck-at-type  errors  in  the  cache  or  tag  array  to  function  correctly,  albeit  more 
slowly.  In  cnabled-without-prefetching  mode,  the  lU  will  cache  instnjctions  upon  demand 
misses  but  will  not  initiate  any  prefetches. 
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The  normal  mode  is  the  enabled-with-prefetching.  After  the  missed  instruction  is 
cached,  prefetches  are  made  to  subsequent  words  within  the  block  until  another  demand  miss 
occurs  or  prefetch  is  blocked  by  the  EU’s  external  data  access.  These  prefetches  are  free  , 
as  they  never  interfere  with  external  cache  accesses,  such  as  instruction  fetch  and  external 
data  reference,  by  the  EU,  because  prefetch  has  the  lowest  priority.  If  prefetch  causes  an 
external  cache  miss,  the  cache  controller  simply  ignores  the  request. 

The  instruction  unit  is  controlled  by  two  finite-state  machines:  one  controls  the  fetching 
and  the  other  controls  the  prefetching  of  instructions.  Two  finite  state  machines  and  other 
random  control  logic  are  partitioned  into  6  PLAs,  considering  the  timing  constraints.  Both  lU 
and  register  file  use  the  same  6T  SRAM  memory  cell  [She84].  The  data  portion  of  the  cache 
is  an  array  of  128  33-bit  words.  The  tags  are  stored  in  a  separate  array  (16  24-bit  words) 
whose  access  time  is  significanUy  less  than  that  of  the  data  array.  This  allows  the  tag  com¬ 
parison  to  be  done  while  the  instruction  is  being  read  out  from  the  data  array.  Bitwise  com¬ 
parison  using  an  XOR  gate  is  used  for  tag  comparison  and  is  followed  by  dynamic  logic  to 
determine  a  hit.  The  effective  access  time  of  the  instruction  cache  including  hit  logic  is  under 
12  nsec  without  using  a  sense  amplifier. 

2.3.2.  The  execution  unit 

Key  features  in  the  execution  unit  are  a  register  file  with  eight  overlapped  windows, 
double  internal  forwarding  for  resolving  register  access  conflicts,  and  run-time  tag  checking 
with  traps  to  software  on  mismatch.  The  SPUR  CPU  has  a  30-bit  branch  address  adder  in  the 
upper  data  path,  which  together  with  the  32-bit  ALU  in  the  lower  data  path  support  one-cycle 
compare-and-branch  type  instructions.  Rather  than  a  complex  barrel  shifter,  a  combination 
of  a  byte  extractor/inserter  and  a  simple  shifter  is  implemented. 

A.  The  register  file 
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The  SPUR  CPU  has  a  total  of  138  general-purpose  registers  organized  in  8  overlapped 
windows  and  10  global  registers.  Thirty-two  registers  are  visible  to  the  compiler  at  any  one 
time:  10  globals,  10  locals,  6  overlapped  with  caller  window,  and  6  overiaKJed  with  callee 
window.  Each  register  is  40  bits  wide  having  a  6-bit  tag,  2  bits  for  generation  number  and  32 
bits  for  data.  The  same  6T  SRAM  ceU  used  in  the  lU  is  used  in  the  register  file.  The  layout  of 
SRAM  cell  is  constrained  by  the  pitches  of  the  data  path  bit  slice  and  the  register  decoders 
(two  decoders  per  register).  The  result  is  a  large  but  fast  SRAM  cell  that  does  not  require  a 
sense  amplifier. 

The  SPUR  CPU  architecture  is  register  oriented  and  requires  two  reads  and  one  write 
per  cycle.  The  register  access  is  time  multiplexed  for  the  separate  reads  and  the  write  and  is 
pipelined  to  minimize  the  critical  path.  Bit  lines  are  decoded  and  precharged  in  the  same 
phase,  and  the  register  array  is  accessed  in  the  following  phase  by  driving  the  wordline.  The 
access  time  of  the  register  file  read  is  the  critical  path  of  the  chip.  It  is  measured  to  be  under 
14  nsec.  For  registers  in  the  overlapped  window,  a  special  decoder  shown  in  Figure  2-7  is 
used  to  map  two  different  register  addresses  (one  from  the  caller’s  window  and  the  other  from 
the  caUec’s  window)  to  one  register  [Kat83]. 

In  the  pipelined  execution  of  the  instruction  stream,  data  interdependencies  among 
instructions  in  the  pipeline  may  arise.  In  the  SPUR  CPU,  these  interdependencies  are  detected 
and  resolved  by  the  hardware  internal  forwarding.  That  is,  the  results  from  preceding  instruc¬ 
tions  are  forwarded  to  the  following  instructions  by  the  hardware  before  being  written  back  to 
the  register  file,  as  indicated  by  the  arrows  in  Figure  2-5.  In  the  case  of  a  4-stage  pipeline  like 
the  SPUR  CPU,  the  data  interdependencies  may  exist  among  3  consecutive  instructions  since 
the  write-back  stage  of  the  pipeline  is  delayed  by  two  cycles  after  the  execution  stage.  The 
result  available  from  each  instruction’s  execution  stage,  therefore,  needs  to  be  stored  in  tem¬ 
porary  registers  for  two  cycles  and  tlien  forwarded  to  the  following  instructions.  When  both 
operands  are  registers,  each  register  address  is  compared  to  the  destination  register  address  of 
the  two  preceding  instructions.  This  may  result  in  double  internal  forwarding,  in  that  both 
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Figure  2-7.  Overlapped  window  register  decoder 


operands  are  results  of  two  preceding  instructions  and  hence  supplied  from  the  temporary 
registers. 

The  hardware  internal  forwarding  logic  is  in  the  critical  path  of  the  register  file  access, 
and  it  must  be  implemented  without  slowing  down  the  cycle  time.  Like  decoding  and  access¬ 
ing  the  register  array,  it  also  is  pipelined.  Address  comparisons  are  done  in  parallel  with  the 
decoding  of  the  register  file,  and  internal  forwardings  are  made  if  necessary  while  tlie  register 
file  is  accessed.  Four  address  comparisons  are  necessary  to  detect  all  possible  data  dependen¬ 
cies.  Tlie  address  comparator  must  be  fast  to  keep  the  cycle  time  short,  and  it  must  be  com¬ 
pact  to  fit  in  the  area  between  register  decoders  and  temporary  registers,  as  seen  in  Figure  2-3 
(block  IF).  Bitwise  comparison  is  done  using  a  dynamic  XOR,  shown  in  Figure  2-8,  and  then 
the  outputs  are  fed  into  the  domino  circuit  for  an  address  match.  Since  this  XOR  docs  not 
require  complementary  inputs,  routing  and  area  required  are  significantly  reduced.  A  special 
multiplexor,  shown  in  Figure  2-8,  is  used  to  minimize  the  signal  delay  through  the  internal 
forwarding  logic  that  lies  between  the  register  file  and  the  functional  unit.  If  internal  forward- 
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ing  is  necessary,  the  bus  from  the  register  file  is  disconnected  by  the  transmission  gate,  and 
the  bus  to  the  functional  unit  is  driven  by  the  temporary  register.  The  access  time  of  register 
file  reading  (14  nsec)  includes  the  delay  through  the  internal  forwarding  logic. 

B.  The  data  path 

The  data  path  is  divided  into  two  parts:  the  upper  data  path  for  program  counter  logic 
and  special  registers,  and  the  lower  data  path  for  general  computations  on  tagged  registers. 
Functional  units  in  the  lower  data  path  include  a  byte-extractor,  a  byte-inserter,  a  simple 
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shifter  that  shifts  up  to  three  bits,  and  an  ALU.  The  ALU  provides  XOR,  OR,  AND,  ADD, 
and  SUBTRACT  operations  and  comparison  for  two  32 -bit  operands.  The  upper  data  path 
consists  of  a  number  of  program  counters  to  hold  instruction  addresses  in  the  pipeline,  an 
address  incrementer  and  adder,  and  special  registers  such  as  window  pointers  and  processor 
status  words.  All  registers  and  counters  are  made  of  pseudo-static  latches,  such  that  each 
register  is  refreshed  once  per  cycle.  This  is  necessary  because  an  indefinite  pipeline  stall  is 
possible  due  to  a  long  external  cache  miss. 

In  the  SPUR  CPU,  compare-and-branch  instructions  are  executed  in  only  one  cycle.  A 
separate  adder  in  the  upper  data  path  calculates  the  target  addresses  for  aU  the  compare-and- 
branch  instructions  while  the  ALU  is  in  use  for  the  comparison.  Two  different  adder  designs 
are  employed.  The  32-bit  ALU  uses  four  8-bit  carry  lookahead  adders  implemented  in  dom¬ 
ino  logic,  and  evaluates  the  carry  within  11  nsec.  The  30-bit  address  adder  is  more  compact 
because  it  uses  a  Manchester  carry  chain  which  has  a  carry  propagation  delay  of  13.5  nsec. 

The  upper  8-bit  slices  of  the  lower  data  path  are  for  tag-related  operations.  Operations 
on  the  tag  and  the  data  are  logically  independent,  that  is,  no  information  moves  between  the 
two  parts  by  carry  propagation  or  any  other  implicit  mechanism.  For  operations,  the  6-bit  tag 
type  is  checked  in  parallel  with  the  data  operation.  If  there  is  a  tag  mismatch  and  the  tag  trap 
enable  bit  is  set  in  the  user  processor  status  word  (UPSW),  the  CPU  traps  to  the  software. 
Generation  tag  checking  (2  MSB)  is  done  when  a  special  store  instruction  (ST_40  Rsi,  Rsi, 
Immediate)  is  executed.  Generation  tag  exception  may  occur  if  the  object  (Rsd  with  a  higher 
(younger)  generation  number  is  stored  into  the  object  (Rsi)  with  a  lower  generation  number 
[Ung84].  The  readjag  and  write_tag  instructions  move  a  tag  to  and  from  the  data  portion  of 
a  register  using  the  byte-extractor  and  the  byte-inserter  respectively,  so  that  any  arithmetic  or 
logical  operations  may  be  performed  on  it. 

To  reduce  the  chip  area  and  improve  the  circuit  speed,  the  dynamic  circuit  technique 
called  domino  logic  [KLL82]  is  heavily  used  in  the  design.  Potential  charge  sharing  problems 
are  prevented  either  by  the  use  of  abundant  clock  phases  or  by  careful  layout  of  the  critical 
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nodes.  The  SPUR  CPU  has  7  major  busses  to  provide  communications  both  externally  and 
internally.  Some  of  these  busses  have  high  capacitive  loadings,  and  hence  precharging  is  used 
to  improve  the  speed  of  data  flow  through  the  highly  capacitive  busses.  The  high  capacitance 
bus  is  precharged  to  high  before  being  used  and  discharged  conditionally  through  a  strong 
NMOS  pull  down  network  when  used.  This  not  only  reduces  the  signal  delay  through  the  bus 
but  also  minimizes  the  chip  area  required  for  a  strong,  large  driver.  Some  logic  function  may 
be  included  in  the  pull-down  network  as  weU,  further  saving  the  chip  area.  Critical  paths  of 
the  data  path,  register  file,  and  instruction  cache  are  summarized  in  Table  2-2. 

C.  The  control 

Four-phase  clocking  and  a  uniform  four-stage  pipeline  for  aU  SPUR  integer  instructions 
make  the  control  section  of  the  CPU  relatively  simple.  The  SPUR  CPU  uses  internal  instruc¬ 
tions  to  handle  pipeline  interrupts,  rather  than  requiring  complex  sequences  for  those  excep¬ 
tions.  These  internal  instructions  are  miss,  trap.call,  and  read_pc  to  handle  instruction  cache 
miss  and  all  kinds  of  traps.  These  instructions  are  executed  in  the  same  way  other  instructions 


phase 

operation 

critical  path  (nsec) 

phil 

Register  file  -  read 

14.0 

phi  2 

Instruction  Cache  -  fetch 

12.0 

phi  3 

ALU  -  32b  carry  propagation 

11.0 

phi4 

Address  adder  -  30b  carry  propagation 

13.5 

Table  2-2.  Critical  path  timing 
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are  executed.  The  use  of  these  instructions  further  simplies  the  control  design. 

The  control  can  be  divided  into  Uiree  parts;  master  control,  trap  logic,  and  the  interface 
to  the  cache  control/memory  management  unit  (CMU).  The  latter  two  are  separated  out  from 
the  master  control  to  simplify  the  control  design.  Trap  logic  detects  all  unusual  conditions 
during  the  pipelined  execution  of  an  instruction.  All  traps  are  taken  during  an  instruction’s 
third  pipeline  stage,  and  hence  only  one  instruction  can  cause  a  trap  in  any  cycle.  The  trap 
logic  consists  of  pipelined  modules,  each  of  which  operates  at  the  corresp>onding  stage  of  the 
instruction  in  the  pipeline.  The  CMU  interface  logic  generates  cache  opcodes  according  to 
the  current  instruction  and  the  status  of  the  CPU.  It  also  buffers  signals  to  and  from  the  CMU. 

The  block  diagram  of  the  master  control  is  shown  in  Figure  2-9.  A  centralized  master 
control  unit  controls  the  processor  sequencing  and  decodes  the  opcode  into  high  level  control 
signals.  Local  random  logic  blocks  then  decode  the  high  level  signals  into  low  level  signals 
using  clocks.  They  also  provide  buffering  of  the  low  level  signal  according  to  the  loading 
requirement  All  signals  controlling  the  data  path  are  individually  optimized  so  as  to  have 
equal  delays  relative  to  the  clock  edges.  The  separation  of  master  control  and  local 
decoding/buffering  significantly  reduces  the  amount  of  routing  between  two  sides,  particu¬ 
larly  in  CMOS  design  where  complementary  signals  are  required  in  controlling  the  data  path. 

Most  of  the  control  logic  in  the  SPUR  CPU  is  implemented  in  static  PLAs.  The  largest 
PLA  is  the  one  that  decodes  the  opcode,  which  has  69  product  terms  with  40  outputs.  The 
propagation  delay  llirough  this  PLA  is  about  15  nsec,  well  below  the  required  timing  of  two 
phases  or  50  nsec.  All  PLA  outputs  are  evaluated  once  per  cycle  and  need  to  be  held  in  regis¬ 
ters  until  the  next  cycle.  The  routing  between  the  PLA  and  the  registers  may  consume  sub¬ 
stantial  chip  area  since  the  PLA  output  pitch  is  so  small  compared  to  the  pitch  of  the  regis¬ 
ters.  Thus,  the  registers  (pseudo-static  latches)  are  integrated  into  the  output  section  of  the 
PLA  by  widening  the  PLA  output  pitch  (16  lambda  to  20  lambda).  This  results  in  an  unusu¬ 
ally  large  PLA,  but  the  chip  area  required  is  much  less  than  if  the  PLA  and  the  registers  were 
separated,  and  the  timing  requirement  is  still  satisfied. 
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(for  Lxjwcr  Datapath) 

Figure  2-9.  Block  diagram  of  master  control 


2.4.  Design,  Verification,  and  Testing  Methodology 

Methodologies  employed  in  the  SPUR  CPU  design  have  been  influenced  by  the  follow¬ 
ing  two  tliemes  of  the  SPUR  project;  (1)  an  overall  system-wide  rather  than  local  optimiza¬ 
tion,  and  (2)  designing  a  chip  for  a  working  system  rather  than  an  experimental  prototype. 
Consequently,  methodologies  became  very  important  since  the  chip  being  designed  must 
meet  all  the  functional  requirements  set  for  tlie  system  design  as  well  as  perfonnance  goals. 
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2.4.1.  Design  methodology 

The  design  strategy  incorporated  both  top-down  and  bottom-up  approaches.  The  top- 
down  flow  was  as  follows:  architecture  definition,  instruction  set  design,  microarchitecture 
design,  and  a  detailed  functional/behavioral  description  of  the  hardware.  The  bottom-up  flow 
was  circuit  design  of  basic  components,  layout  of  basic  cells,  assembly  of  major  blocks  using 
those  cells,  and  global  placement  and  interconnections.  Both  approaches  were  taken  in  paral¬ 
lel  from  the  beginning,  to  achieve  the  highest  performance  at  given  technology  and  system 
design  goals.  For  instance,  many  microarchitecture  decisions  were  made  after  the  feasibility 
of  a  certain  hardware  resource  was  carefully  considered.  Division  of  design  tasks  followed 
the  same  hierarchical  boundaries  of  design  abstractions:  architecture  and  instruction  set 
design,  microarchitecture  design,  and  VLSI  implementation.  One-  or  two-person  groups  were 
formed  to  take  the  responsibilities  of  each  design  level.  Oose  interaction  among  different 
groups  was  necessary  to  make  clean  interfaces  among  themselves  and  design  specifications. 

Most  of  the  CAD  tools  used  in  designing  the  SPUR  CPU  chip  were  developed  at  Berke¬ 
ley,  except  those  for  the  behavioral  level  design.  The  detailed  design  started  with  describing 
the  behavior  of  the  chip  and  its  interactions  with  other  components  within  the  system.  The 
functional  behavior  was  written  in  ISP',  a  hardware  description  language,  and  simulated 
using  the  N.2  simulator  [N.2  Simulator.].  The  implementation  of  the  hardware  can  be  divided 
into  two  parts.  Most  parts  of  the  control  design  were  done  using  a  set  of  CAD  tools  that 
automatically  synthesizes  the  behavioral  description  of  the  combinational  logic  into  the  PLA 
[Scg87],  [EEE88].  Other  parts  of  the  control  logic  (sequential)  and  data  paths  were  designed 
manually  but  aided  by  another  set  of  tools.  These  two  paths  are  diagramed  in  Figure  2-10. 
For  the  automated  synthesis  path,  only  those  parts  of  the  hardware  description  containing 
combinational  logic  can  be  synthesized.  For  the  manual  part,  logic  and  circuit  design  were 
done  first  for  each  block  and  followed  by  layout.  Layout  was  done  using  an  interactive  lay¬ 
out  editor.  Magic  [Ous85],  with  background  design  rule  checking  and  hierarchical  extracUon. 
The  extracted  layout,  which  is  a  switch-level  description  of  the  chip,  was  simulated  using 
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bdsim,  a  switch-level  simulator  [Seg87]. 

Timing  analysis  was  done  before  the  layout,  to  make  early  tradeoffs  among  many  alter¬ 
natives  and  after  the  layout,  to  perform  an  exact  timing  analysis  with  all  parasitics  correctly 
annotated.  To  estimate  the  critical  paths  of  the  chip  more  accurately,  and  thus  to  be  able  to 
determine  the  cycle  time,  a  test  chip  containing  a  register  file  with  internal  forwarding  was 
fabricated  and  tested  [Lee86].  The  measured  critical  path  (register  file  read)  was  below  18 
nsec,  and  this  encouraged  us  to  set  the  cycle  time  goal  at  100  nsec. 


Figure  2-10.  Design  methodology 
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2.4.2.  Verification  methodology 

The  verification  methodology  was  constructed  following  a  bottom -up  approach.  As 
each  individual  module  was  designed,  switch-level  simulation  was  performed  on  the 
extracted  layout  to  verify  the  design.  A  small  set  of  hand- written  test  vectors  was  used  for  the 
simulation.  Once  individual  modules  were  verified,  they  were  connected  and  then  simulated 
together  until  the  integration  reached  the  major  blocks,  the  execution  unit  and  the  instruction 
unit  Test  vectors  up  to  this  point  were  small  and  easy  to  generate  by  hand,  since  the  test 
sequences  required  to  verify  operations  on  these  units  separately  were  relatively  simple. 
After  all  major  blocks  were  integrated,  the  verification  effort  was  directed  at  both  functional 
and  switch  levels. 

Functional  simulations  are  performed  not  only  on  each  major  component,  to  verify  its 
internal  functions  but  also  on  the  external  system  level,  to  verify  interactions  among  major 
chip  sets.  The  diagnostics  for  the  functional  simulation  were  coded  in  SPUR  instructions,  and 
an  instruction  level  simulator  called  Barb  was  used  to  debug  the  diagnostics.  The  diagnostics 
were  intended  to  be  stored  in  the  start-up  ROM  on  the  processor  board.  The  N.2  system  pro¬ 
vided  simulated  memories  that  could  be  used  to  model  ROM  or  other  types  of  memory. 
Therefore  the  diagnostics  were  assembled  and  loaded  into  the  simulated  memory.  When  the 
N.2  simulation  was  started,  it  was  forced  to  go  through  a  series  of  start-up  sequences,  making 
the  CPU  begin  fetching  instructions  from  tlie  ROM  containing  the  diagnostics.  The  diagnos¬ 
tics  were  then  executed  to  completion  or  until  failure.  The  same  ROM  image  was  used  to  pro¬ 
gram  tlie  EPROMs  on  the  processor  board,  to  be  used  for  on-board  testing  of  the  chip. 

Running  extensive  simulations  on  the  hardware  description  verified  many  design  ideas 
and  functionalities,  but  it  was  still  necessary  to  extract  and  simulate  the  layout  of  the  entire 
chip.  The  extracted  description  is  almost  guaranteed  to  accurately  model  the  real  chip.  How¬ 
ever,  developing  the  tests  and  examining  the  results  for  a  complete  switch-level  simulation 
would  be  very  difficult.  To  minimize  the  required  work,  tlie  functional  simulation  should 
drive  tlie  switch-level  simulation  with  automatically  verifying  that  the  two  match  at  every 
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clock  cycle.  Fortunately,  the  N.2  simulator  provides  a  "tracing"  capability  that  logs  all 
changes  to  a  specified  set  of  signals  into  a  file.  By  tracing  all  inputs  and  outputs  of  an  N.2 
module,  it  is  possible  to  obtain  a  set  of  switch-level  lest  vectors  automatically.  These  vectors 
along  with  expected  results  on  output  nodes  are  fed  into  the  the  switch-level  simulation.  The 
switch  level  simulator,  bdsim,  sets  the  input  nodes  according  to  the  timing  and  vectors 
specified  and  verifies  the  output  nodes  with  the  expected  results.  Any  unusual  condition  is 
recorded  so  as  to  be  used  in  debugging. 

A  problem  may  arise  because  functional  simulation  and  switch-level  simulation  may 
show  different  results  under  unusual  states,  such  as  unknown  and  initial  states.  For  example, 
the  functional  simulator  initializes  all  nodes  to  zero,  while  all  nodes  are  set  to  unknowns  ini¬ 
tially  in  the  switch-level  simulatioiL  When  the  chip  is  tested  neither  of  these  initial  conditions 
is  correct.  To  alleviate  the  problem,  all  internal  states  are  initialized  explicitly  in  the  func¬ 
tional  simulations.  In  the  switch-level  simulation,  on  the  other  hand,  the  detailed  verifications 
are  made  after  the  initialization  is  done  and  all  internal  states  are  synchronized  with  those  of 
functional  simulatioa 

To  have  a  working  system  rather  than  a  prototype  chip,  aU  aspects  of  the  design  had  to 
be  verified,  especially  the  interfaces  to  external  chips.  Table  2-3  summarizes  the  diagnostic 
vectors  simulated  in  both  functional  and  switch-level  simulations. 

2.4.3.  Testing  methodology 

Several  features  were  incorporated  into  the  SPUR  CPU  chip  to  increase  its  testability. 
Passive  scan  registers  arc  attached  to  aU  major  busses  to  increase  the  observability.  All  sig¬ 
nals  put  on  these  busses  can  be  scanned  out  for  examination.  All  major  blocks  are  connected 
and  communicate  through  these  busses,  so  that  the  diagnostic  capability  is  greatly  improved. 
Many  signals,  like  state  bits  of  finite  state  machines  in  lU  and  the  LSBs  of  the  instruction 
address  bus  (busPC),  arc  also  routed  out  to  pins  to  determine  the  exact  status  of  the  processor 
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diagnostics 

test  vector  length  (cycles) 

CPU  functions 

13,113(24%) 

CMU  interface 

16,356  (29%) 

FPU  interface 

1,543(3%) 

Lisp  tags  and  traps 

8,675  (16%) 

Boot-up  diagnostics 

15,829  (28%) 

Table  2-3.  Diagnostics 


at  any  time.  The  CPU  sends  out  an  instruction  every  cycle  to  the  FPU  (via  busi),  and  it  also 
provides  the  observability  of  the  instruction  being  executed,  including  internal  instructions. 
The  lU  and  the  EU  can  be  physically  separated  by  setting  certain  diagnostic  pins.  Further¬ 
more,  some  of  the  lower  order  bits  of  instruction  address  bus  were  routed  out  to  pins.  Using 
these  features,  instructions  can  be  delivered  directly  to  the  EU  in  case  the  instruction  unit  is 
not  functional,  by  monitoring  the  instruction  address  CbusPC<10:2>)  available  on  pins. 

The  initial  testing  was  done  on  a  special  board  made  for  the  SPUR  CPU  chip.  The  Tek¬ 
tronix  DAS  9100  system  is  connected  to  the  board  and  controlled  from  a  SUN  workstation. 
The  test  set-up  is  shown  in  Figure  2-11.  The  same  vectors  used  in  the  switch-level  simula¬ 
tions  are  converted  into  test  vectors.  For  short-cycle  testing,  test  vectors  were  downloaded  to 
DAS  and  testing  was  performed.  A  special  set-up  was  necessary  for  long-cycle  testing,  since 
the  DAS  can  only  hold  up  to  256  cycles  of  test  vectors.  Long  vectors  are  divided  into  several 
parts  to  fit  in  the  DAS  capacity.  The  division  was  made  at  the  instruction  accessing  memory 
(external  cache),  such  that  the  CPU  was  deliberately  made  to  stall  on  cache  miss  by  control- 
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ling  the  CMU  interface  pin  (cache  busy),  while  the  next  portion  of  the  vector  is  being  down 
loaded.  All  signals  acquired  during  the  testing  are  transferred  back  to  the  SUN  workstation 
for  a  cycle-by-cycle  verification  with  the  expected  result.  Most  of  the  CPU  functionalities  are 
tested  using  the  initial  test  set-up.  After  the  debugging  is  done,  the  CPU  chip  is  put  on  a 
SPUR  processor  board  to  test  interactions  with  other  components  on  the  board,  especially 
with  the  CMU. 


REFCLK 


Figure  2-11.  Chip  test  set-up 
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2.4.4.  Design  metrics 

The  design  metrics  for  the  SPUR  CPU  are  presented  in  Table  2-4.  It  provides  an 
approximate  design  time  spent  on  both  the  circuit  design  and  the  layout,  in  terms  of  man- 
months.  The  total  design  time  of  the  SPUR  CPU  is  estimated  at  about  5  man  years.  This 
includes  behavioral  modeling,  VLSI  design,  verification,  and  testing.  Some  of  these  activities 
were  performed  in  parallel,  and  the  times  shown  in  Table  2-4  are  for  the  VLSI  design  only. 
Approximately  of  the  total  development  time  (or  l*/2  man  years)  was  spent  on  verification 
and  testing  of  the  chip.  There  are  total  of  13  PLAs  used  in  both  the  lU  control  and  the  master 
control.  These  PLAs  are  summarized  in  Table  2-5. 

The  transistor  count  of  the  chip  reaches  over  1 15,000.  More  than  50%  of  transistors  or 
about  60,000  transistors  are  SRAMs  used  to  implement  the  register  file  and  the  instruction 
cache,  which  occupy  about  Vi  of  the  total  active  chip  area.  Area  estimates  (percentages) 
shown  do  not  include  any  routing  region,  so  numbers  may  not  add  up  to  the  totals.  Regular¬ 
ity  of  each  unit  is  computed  by  taking  the  ratio  of  total  transistor  to  total  drawn  transistors  of 
each  unit.  Comparison  of  design  metrics  to  other  microprocessors  is  presented  in  Table  2-6. 

2.4.5.  Results 

The  first-pass  silicon  had  a  few  bugs,  including  circuit  design,  layout,  and  timing  errors, 
but  it  worked  enough  to  be  used  for  initial  debugging  of  the  processor  board.  The  layout 
errors  discovered  were  misplaced  well  and  substrate  contacts  onto  signals  rather  than  power 
supply  lines.  These  effectively  shorted  the  signal  to  either  the  ground  or  the  power  line, 
resulting  in  a  stuck-at  type  fault.  Some  of  these  errors  were  corrected  by  isolating  the  mis¬ 
placed  contacts  from  the  power  supply  using  the  laser  restructuring  technique  provided  by  the 
Information  Science  Institute  (ISI).  Either  the  first  level  or  the  second  level  metal  can  be 
disconnected  by  using  a  laser  shot  through  tlie  passivation  layer.  The  second  (topmost)  level 
metal  lines  with  width  of  3  pm  were  cut  successfully  without  affecting  other  structures 
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Block 


Layout  Area 

Height  Width  %  Area 
(lambda)  (lambda) _ 


Instruction  Unit 

4456 

143% 

lU.CTR 

1508 

1.9% 

IB_Cache 

4267 

4899 

17.4% 

IB_TAG 

2423 

1214 

2.5% 

Register  File 

4412 

5478 

20.2% 

Registers 

3006 

4533 

11.4% 

Decoders 

1089 

4656 

4.2% 

IF  Logic 

221 

779 

0.1% 

DSTl  &  DST2 

3100 

930 

2.4% 

Master  Control 

3522 

3422 

SEQUENCER 

2424 

3388 

6.9% 

TRAP.LOGIC 

645 

586 

0.3% 

CC_INT 

877 

534 

0.4% 

SPD_LOGIC 

223 

490 

0.1% 

Local  Control 

1.0% 

RegFile_CTR 

835 

212 

0.2% 

Func_CTR 

136 

500 

PCLOGIC.CTR 

290 

2110 

SpecReg_CTR 

279 

811 

0.2% 

Special  Registers 

2530 

3.6% 

UPSW 

2525 

0.5% 

KPSW 

2536 

1.1% 

CWP  &  SWP 

2755 

846 

1.9% 

Functional  Units 

1897 

5.0% 

Byte-Extractor 

248 

0.6% 

Byte-Inserter 

262 

0.7% 

Shifter 

3160 

336 

0.9% 

ALU 

3166 

1078 

2.8% 

PC  Logic 

2756 

2098 

4.8% 

Miscellaneous 

MBR 

3093 

482 

1.2% 

MAL 

1125 

711 

0.7% 

Scan_Registers 

3200 

1600 

4.3% 

busjnterface  . 

3154 

584 

1.5% 

CLK_GEN 

455 

2127 

0.8% 

PADS  &  OUiers 

Total 

10140 

11820 

100.0% 

Table  2-4.  Design  metri 
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( transistors 

Design  Time 

Regularity 

Circuits 
(Man  month) 

Layout 
(Man  month) 

37622 

17.8 

1.0 

2.0 

mmm 

2.0 

0.6 

1.0 

31583 

03 

0.7 

4538 

0.1 

0.3 

42924 

ns 

mSm 

3S 

33120 

5520.0 

1.5 

6300 

9.0 

0.5 

210 

2.1 

05 

0.5 

3294 

11.0 

03 

0.5 

3849 

1.7 

13 

2.4 

2190 

5.0 

05 

1070 

1.2 

506 

PLA 

0.2 

0.3 

83 

PLA 

0.1 

0.1 

721 

1.0 

0.4 

03 

137 

1.0 

0.1 

0.2 

46 

1.0 

0.1 

0.1 

372 

1.0 

0.1 

0.1 

166 

1.0 

0.1 

_ _ 

3502 

WBM 

03 

905 

0.1 

1020 

■SI 

0.1 

0.2 

1577 

■■ 

0.1 

0.2 

5619 

0.8 

MSm 

209 

0.1 

395 

0.1 

768 

0.1 

0.3 

4247 

03 

1.0 

6370 

43 

2.0 

4.0 

1619 

11.0 

0.1 

0.2 

748 

30.0 

6028 

36.0 

0.3 

1581 

17.8 

0.5 

392 

1.0 

1.0 

1.0 

4239 

0.5 

0.5 

115214 

12.6 

9.6 

17.4 

of  the  SPUR  CPU 


PLA 

#  product  terms 

#  outputs 

#  inputs 

Power  (mW) 

OPCODE 

68 

40 

8 

54.0 

FAST_LOGIC 

16 

14 

18 

15.0 

SPD_LOGIC 

7 

2 

6 

4.5 

TRAP_ENABLE 

15 

13 

24 

14.0 

TRAP_TYPE 

11 

9 

11 

10.0 

CC_OPGEN 

25 

6 

13 

15.5 

CCJNT 

17 

5 

12 

11.0 

IU_CTR_P1 

6 

4 

7 

5.0 

IU_CTR_P2 

10 

5 

9 

7.5 

IU_CTR_P3 

30 

8 

10 

19.0 

IU_CTR_P4 

21 

6 

16 

13.5 

IU_FET_FSM 

14 

3 

10 

8.5 

IU PF FSM 

14 

3 

10 

8.5 

Total 

254 

118 

157 

186.0 

Table  2-5.  SPUR  CPU  PLAs 


CPU 

#  transistors 
(1000s) 

Regularity 

Design  time 
(man  years) 

SPUR  CPU 

115K 

12.6 

5.0 

SOAR 

36K 

8.3 

3.2 

RISC  II 

41K 

20.0 

2.5 

M68000 

68K 

12.0 

14.2 

80386 

181K 

NA 

50.0 

Table  2-6.  Comparison  of  design  metrics 


nearby.  Other  problems  found  were  timing  errors  and  glitches  on  signals  controlling  the 
dynamic  circuits.  The  glitch  was  caused  by  the  excessive  ringing  on  clock  lines.  The  long 
running  clock  lines  (10  mm)  can  have  parasitic  inductance  and  capacitance  large  enough  to 
cause  a  substantial  ringing,  which  may  trigger  any  hazardous  glitch. 

Several  electrical-rule  checks  were  perfonmed  to  avoid  repeating  the  same  errors  for  the 
second  pass.  However,  there  was  still  another  layout  error  discovered  after  the  fabrication.  A 
portion  of  metal  wire  was  missing,  leading  to  a  disconnected  signal.  A  focused  ion  beam 
(FIB)  IC  development  system,  provided  by  the  Seiko  instrument  company  was  used  to  fix  the 
problem.  Two  holes  were  drilled  on  separated  wires  through  the  passivation  layer  to  reach 
metal  lines,  using  an  ion  beam,  and  connected  using  FIB-CVD  (chemical  vapor  deposition) 
metal  film  deposition  between  the  two  points.  The  revised  and  repaired  chip  is  fully  func¬ 
tional  and  is  used  in  a  working  SPUR  processor  board  successfully  executing  its  own  operat¬ 
ing  system  (Sprite)  as  well  as  many  applications  including  LISP  programs.  The  nominal 
operating  frequency  of  the  chip  on  the  processor  board  is  10  MHz,  while  the  maximum 
operating  frequency  is  12.5  MHz  (80  nsec  cycle  time). 
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2.5.  Summary 

The  SPUR  CPU  is  a  single-chip  RISC  microprocessor  designed  for  a  multiprocessor 
workstation.  It  supports  a  multilevel  cache  scheme  including  a  prefetching  on-chip  instruc¬ 
tion  cache,  a  coprocessor  interface,  and  support  for  the  fast  execution  of  LISP  through  a 
tagged  40-bit  architecture.  In  order  to  build  a  working  computer  system  based  on  the  SPUR 
CPU  chip,  reliable  and  efficient  methodologies  were  necessary  throughout  the  desiga  The 
chip,  fabricated  in  a  1.6  pm  double  metal  CMOS  process,  works  well  in  the  multiprocessor 
system  prototype,  and  it  met  both  of  the  functional  and  performance  goals  set  at  the  initial 
stage  of  die  design.  It  runs  at  10  MHz  consistently  for  all  programs  and  dissipates  less  than 
0.8  W  of  power. 
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On-chip  Memory  Design 


3.1.  Introduction 

A  fundamental  limitation  in  microprocessor  performance  is  set  by  the  ratio  of  the 
amount  of  memory  traffic  to  the  available  i/o  pin  bandwidth  of  the  microprocessor  chip.  The 
microprocessor’s  memory  traffic  consists  of  instruction  and  data  transfers  in  and  out  of  the 
chip.  As  the  cycle  time  of  the  microprocessor  shortens  with  advances  in  the  integrated  circuit 
technology,  the  memory  traffic  required  to  balance  the  overall  system  tliroughput  goes  up 
rapidly  [Kun86].  However,  due  to  various  limitations  the  i/o  pin  bandwidth  remains  relatively 
constant.  The  minimum  pad  size  required  for  wire  bonding  has  been  unchanged  for  years. 
Moreover,  the  number  of  required  power  supply  pins  has  risen  to  accommodate  fast  switch¬ 
ing  i/o  pins,  which  in  turn  reduced  the  number  of  pins  available  for  off-chip  communication. 

To  obtain  tlie  highest  possible  perforaiance  in  a  single  chip  microprocessor  architecture, 
off-chip  communication  must  be  minimized  while  integration  of  hmctionality  is  maximized. 


43 


One  way  of  minimizing  off-chip  communication  is  to  include  local  memory  on  a  chip  as  a 
cache  or  a  set  of  registers  to  hold  frequently  used  instructions  or  data.  Cache  memories  not 
only  provide  fast  accesses  to  instructions  and  data  but  also  reduce  off-chip  memory  accesses 
by  using  the  cached  instruction  and  data  repeatedly.  Therefore,  the  on-chip  memory  design 
has  a  great  impact  on  a  microprocessor’s  performance  and  becomes  increasingly  important 

In  this  chapter,  I  will  examine  on-chip  memory  design  issues  and  present  new  and 
efficient  on-chip  memory  designs  for  microprocessors.  The  focus  is  on  tradeoffs  between  the 
architectural  design  and  the  circuit  design.  Circuit  design  techniques,  when  properly  adapted 
to  the  architectural  design,  can  provide  cost-effective  performance  improvement  easily.  Two 
key  areas  of  interest  in  this  research  are  using  DRAMs  as  a  cache  on  a  microprocessor  chip 
and  multi-port  memory  design  to  facilitate  the  parallelism  using  multiple  functional  units. 
With  high  density  dynamic  memories,  cache  performance  can  be  improved  greatly  since  the 
storage  capacity  (size)  of  memory  is  one  of  the  most  critical  parameters  in  the  on-chip  cache 
design.  Multi-port  memory  can  have  a  great  impact  on  processor  performance  since  it  pro¬ 
vides  high  bandwidth  and  also  facilitates  the  concurrent  operations.  The  results  of  the 
research  can  be  useful  as  a  design  guide  for  alternatives  in  future  microprocessor  memory 
designs. 

This  chapter  is  organized  as  follows:  Section  2  reviews  the  use  of  on-chip  memories  in 
existing  microprocessors,  as  an  instruction  store  and  a  data  store  separately.  Section  3 
presents  a  reliable  and  efficient  way  of  using  dynamic  memory  elements  in  a  microprocessor 
chip.  Section  4  presents  the  local  on-chip  memory  design  for  multiple  functional  units.  To 
satisfy  the  bandwidth  requirement  of  multiple  functional  units,  multiple  memories  or  multi- 
port  memory  design  are  necessary.  The  implementation  issues  of  these  memories  are  con¬ 
sidered  in  that  section.  Section  5  summarizes  the  on-chip  memory  designs. 
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3.2.  On-chip  Memories  in  Microprocessors 

Microprocessor  architecture  is  evolving  as  silicon  integrated  circuits  increase  in  density. 
On-chip  memories  are  becoming  an  established  feature  in  single-chip  microprocessor  designs 
because  they  significantly  improve  performance.  It  is  particularly  important  for  single  chip 
RISC  microprocessors  to  include  large,  high-speed  memories,  because  RISC  chips  must 
reduce  off-chip  memory  delays  to  achieve  the  shortest  possible  cycle  time.  The  organization 
of  the  on-chip  memory  is  therefore  very  important  in  the  design  of  high  performance  VLSI 
single-chip  processors. 

Memories  are  used  in  various  forais  on  microprocessor  chips.  Fast  storage  for  instruc¬ 
tions  and  for  data  are  two  distinct  needs  for  on-chip  local  memories.  The  separation  of  local 
memories  for  instructions  and  data  is  common,  in  part  to  increase  effective  memory 
bandwidth.  A  mixed  instruction  and  data  cache  is  not  as  effective  as  separate  caches,  unless 
dual-ported  memory  is  used  to  resolve  the  memory  contention  between  instruction  and  data 
memory  references.  When  an  on-chip  memory  is  limited  in  its  capacity,  using  it  as  an 
instruction  cache  or  a  data  cache  can  be  an  interesting  architectural  tradeoff.  This  section 
begins  with  an  examination  of  the  use  of  local  memories  in  existing  microprocessors,  then 
research  focus  and  limitations  are  identified. 

3.2.1.  Local  memory  for  instruction  store 

Microprocessor  performance  can  be  hampered  by  off-chip  memory  access  delays. 
These  delays  are  caused  cither  by  fundamental  limitations  in  off-chip  communication,  or  by 
i/o  contention  between  instruction  and  data  memory  traffic  tlirough  scarce  i/o  pins.  On-chip 
instruction  caches  resolve  these  problems  by  cacliing  instructions  on  the  chip  and  supplying 
them  directly  to  the  execution  unit.  This  allows  i/o  pins  to  be  used  primarily  by  the  data 
memory  accesses,  and  thus  effectively  provides  a  dual-ported  access  to  the  external  memory. 

Instruction  caches  are  simpler  to  design  than  mixed  caches  because  the  cache  is  read¬ 
only  (cache  is  written  only  to  replace  the  missed  block).  Furthermore,  since  instructions  show 
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a  much  higher  degree  of  locality  than  data,  even  a  small  cache  can  improve  processor  perfor¬ 
mance  significanUy.  Many  existing  microprocessors  incorporate  on-chip  instruction  caches 
in  one  form  or  another.  Different  instruction  memory  organizations  are  summarized  in  the 
following. 

Prefetch  buffer  (PB)  holds  instructions  sequentially  forward  from  the  current  program 
counter  in  the  instruction  stream.  PB’s  are  usually  organized  in  a  FIFO  of  instruction  words. 
Many  computers  such  as  IBM  System/370  Model  158,  and  DEC  VAX  11/780,  have  had  PBs. 
Today’s  microprocessors  have  PBs  implemented  as  a  part  of  other  types  of  instruction  cache. 

Instruction  buffer  (IB)  uses  caching  and  prefetching  to  reduce  effective  access  delay  as 
well  as  memory  uaffic.  As  a  conventional  cache,  IB  can  be  organized  as  a  direct-mapped  or  a 
set-associative  cache.  For  small  IB’s,  however,  it  has  been  proved  that  a  direct-mapped  cache 
performs  comparably  to  a  fully  associative  cache  with  LRU  replacement  [SmG83].  Loading 
partial  blocks  upon  IB  misses  (sub-block  placement)  is  also  effective  in  minimizing  the 
memory  traffic  [Goo83],  [Hil87b].  Important  design  parameters  in  designing  an  on-chip  IB 
include  cache  hit  and  miss  time,  cache  size,  and  aspect  ratio  of  cache  memory  when  it  is  actu¬ 
ally  laid  out  inside  the  chip  [ACH87].  Microprocessors  with  on-chip  IB  are  the  Motorola 
68020  [MMM84]  and  68030  [MMM86],  the  National  NS32532,  the  MIPS-x  [Hor87]  at  Stan¬ 
ford,  and  the  SPUR  CPU  [Hil86]  at  U.C.  Beiiceley. 

Target  instruction  buffer  (TIB)  reduces  effective  instruction  access  time  by  caching 
instructions  at  branch  targets  or  at  the  beginning  of  the  instraction  run.  TIB’s  are  usually 
implemented  with  PB’s.  Upon  a  non-sequential  instruction  fetch  (i.e.  branch).  The  TIB  is 
accessed  to  provide  (if  hit  in  the  TIB)  the  next  instruction,  which  is  the  first  instruction  of  the 
next  instruction  run.  Subsequent  sequential  instruction  fetches  arc  handled  by  the  PB.  The 
AMD  Am29000  [AAA87]  RISC  microprocessor  uses  a  TIB  with  PB. 

Branch  Target  Buffer  (BTB)  buffers  the  addresses  of  previous  branches  and  tlieir  target 
addresses  [LeS84].  BTB  is  used  to  reduce  pipeline  bubbles,  resulting  from  waiting  for  the 
next  instruction  address  to  be  determined.  The  instruction  fetch  address  is  compared  with  the 
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content  of  the  BTB  and  if  they  match,  the  next  instmction  address  is  determined  from  the 
BTB.  The  performance  of  the  BTB  depends  on  the  selection  of  a  branch  prediction  algorithm, 
the  size,  and  the  organization  (e.g.  set-associativity). 

Decoded  instruction  buffer  holds  the  hilly  decoded  instructions,  so  that  instruction 
issued  from  instruction  cache  can  be  executed  without  any  further  decoding  delay.  The 
CRISP  microprocessor  [DMB87]  uses  this  form  of  instruction  cache  with  branch-folding. 
When  a  non-branching  instruction  is  immediately  followed  by  a  branch,  the  two  are  folded 
together  to  form  a  single  new  decoded  instruction. 

Performance  of  on-chip  instruction  memory  is  characterized  by  the  effective  access 
delay  of  instructions  over  time.  Cache  access  time  and  miss  handling  time  are  as  important 
cache  parameters  as  cache  hit/miss  rate,  since  together  they  detennine  the  effective  instruc¬ 
tion  access  delay  [HQSVb].  The  physical  size  or  aspect  ratio  of  the  on-chip  instruction 
memory  is  also  important  because  it  must  be  fit  within  the  area  desired  [ACH87].  For  a  given 
silicon  area  the  fastest  effective  access  delay  can  be  achieved  if  the  density  of  memory  is 
maximized  while  the  access  time  is  at  its  minimum. 

3.2.2.  Local  memory  for  data  store 

In  general,  there  are  two  ways  to  organize  the  local  memory  for  data,  conventional 
cache  and  registers.  Referencing  behavior  of  the  data  memory  is  somewhat  different  from  the 
instruction  memory,  and  the  memory  for  data  can  be  controlled  to  some  extent  by  the  pro¬ 
grammer  [McD88].  Goodman  [G0H86]  showed  that  with  a  small  size  of  on-chip  memory, 
registers  can  be  more  effective  than  a  cache  in  reducing  access  delays,  if  an  optimal  register 
allocation  algorithm  is  used  when  compiling  the  program.  Registers  and  data  caches  in 
microprocessors  are  organized  in  various  ways  to  take  advantage  of  different  data  access 
behaviors  of  programs. 

Conventional  cache,  although  it  is  usually  invisible  to  the  programmer,  consistently 
works  well  and  takes  account  of  dynamic  program  behavior.  An  important  parameter  in 
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designing  the  small  size  on-chip  data  cache  is  a  transfer  size  of  data  from  memory  to  the 
cache,  or  line  size  (the  line  is  also  referred  to  as  sub-block  when  transfer  size  is  smaller  than  a 
cache  block).  Given  cache  size,  a  smaller  line  (sub-block)  is  proven  more  effective  than  a 
larger  line  for  the  data  cache,  due  to  the  temporal  locality  [GoH86][Smi82].  A  smaller  line 
size  also  minimizes  the  off-chip  memory  bandwidth  requirement. 

Register  file  organizes  registers  in  either  single  or  multiple  sets.  Single  set,  general  pur¬ 
pose  registers  have  been  widely  used  in  microprocessors.  Efficient  use  of  on-chip  registers 
depends  on  adapted  register  allocation  scheme  [Rad82][Hen81].  A  multiple  register  set 
improves  the  processor’s  performance  by  reducing  the  off-chip  memory  traffic  required  to 
save  and  restore  registers  upon  a  call  or  context  switch.  A  large  register  file  of  RISC  n 
[Kat83]  at  U.C.  Berkeley,  organized  in  a  stack  of  register  sets,  allocates  new  register  sets 
dynamically  on  a  per  procedure  basis. 

Stack  cache  caches  only  memory  references  to  the  stack.  It  operates  just  like  a  conven¬ 
tional  cache  except  that  the  stack  pointer  is  used  in  managing  the  cache.  When  a  miss  occurs 
and  the  word  to  be  replaced  is  dirty,  it  is  written  back  to  memory  only  if  its  location  is  below 
the  top  of  the  stack. 

Top  of  stack  cache  is  a  set  of  high-speed  registers  which  holds  the  top  portion  of  the  fre¬ 
quently  used  stack  entries.  It  takes  advantage  of  the  fact  that  slack  references  will  generally 
occur  near  the  top  of  the  stack,  not  scattered  as  in  a  data  cache.  The  management  of  TOS 
registers  is  as  important  as  register  allocation  in  microprocessors  with  a  general  purpose 
register  set.  This  type  of  cache  has  been  used  in  the  C  machine  [DiM82]  at  Bell  Labs,  and 
the  Dragon  [McC84]  at  XEROX  PARC  . 

The  silicon  area  used  to  hold  a  byte  of  data  in  cache  differs  from  that  used  to  hold  a 
byte  in  a  register.  Cache  requires  tags,  valid  and  dirty  bits,  and  replacement  information  so 
tliat  it  can  be  managed  dynamically  by  the  hardware.  Registers,  on  the  otlter  hand,  must  be 
managed  efficienUy  by  the  software,  and  often  require  multiple  access  capability  to  provide 
high  bandwidth  between  the  execution  unit  and  the  register  file.  To  use  local  memory  most 
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efficiently,  implementation  tradeoffs,  such  as  speed  versus  power  or  multi-port  versus  multi¬ 
ple  sets  of  memories,  must  be  carefully  examined. 

3.2.3.  The  focus  and  limitations  of  the  research 

In  the  previous  two  sections,  we  have  briefly  examined  the  use  of  on-chip  local 
memories  in  many  existing  processors.  Two  key  observations  made  from  the  above  are  sum¬ 
marized  as  the  following.  The  research  presented  in  this  chapter  is  based  on  these  two  obser¬ 
vations. 

(1)  Since  the  on-chip  memory  is  limited  in  its  size,  many  different,  complex,  cache  and 
register  organizations  are  used  for  various  optimizations.  It  is  certain  that  the  increase  in 
memory  size  will  not  only  improve  the  overall  performance  but  also  simplify  the  on- 
chip  memory  design. 

(2)  The  clock  rate  of  a  microprocessor’s  execution  unit  is  increasing  rapidly  as  IC  technol¬ 
ogy  advances,  hence  the  bandwidth  of  the  local  memory  must  be  sufficient  enough  to 
provide  data  at  the  rate  of  the  execution  unit’s  demand.  Furthermore,  to  increase  the 
system  throughput  by  exploiting  the  parallelism  in  hardware,  the  use  of  multiple  fime- 
tional  units  becomes  common.  This,  in  turn,  adds  up  the  bandwidth  requirement  of  the 
local  memory.  Consequently,  a  fast  multi-port  memory  for  multiple  simultaneous  read 
and  write  accesses  may  be  necessary. 

Silicon  real  estate  is  one  of  the  scarce  resources  on  a  single  chip  microprocessor.  There¬ 
fore,  local  memory  must  be  used  efficiently,  and  memory  density  must  be  maximized  at  a 
given  silicon  area.  Traditionally,  mainly  due  to  reliability  concerns,  only  static  random  access 
memories  (SRAMs)  have  been  used  on  most  microprocessor  chips.  Dynamic  random  access 
memories  (DRAMs)  offer  higher  bit  density  at  a  given  silicon  area  than  SRAMs.  However, 
due  to  fundamental  limitations  associated  witli  DRAMs,  such  as  refreshing  requirement  and 
complex  self-timed  controls,  DRAMs  rarely  have  been  untilized  as  on-chip  memories.  In  the 
following  section,  I  propose  techniques  for  a  reliable  and  efficient  use  of  DRAMs  as  on-chip 
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cache  memories.  The  focus  will  be  on  the  instruction  cache.  Trace  driven  cache  simulations 
are  used  to  analyze  newly  proposed  schemes.  The  same  techniques  can  be  applied  to  the  data 
cache  under  certain  restrictions,  which  will  be  discussed  also. 

Increasing  processor’s  performance  by  using  multiple  functional  elements  requires  mul¬ 
tiple  local  memories  or  multi-port  memories,  to  match  the  bandwidth  required  by  the  multi¬ 
ple  functional  units.  Multi-port  memories  are,  however,  much  more  expensive  than  single- 
ported  memories  in  terms  of  silicon  area  required  and  operating  speed.  A  micro-architect 
must  examine  all  possible  memory  designs  in  order  to  build  a  high  performance  microproces¬ 
sor.  Within  this  research  I  will  examine  a  few  alternatives  for  multi-port  memory  designs, 
such  as  a  dual-ported  memory  design  based  on  6T  SRAM  cells  and  extending  the  design  into 
the  multi-port  memory  with  more  than  two  read/write  ports.  The  goal  is  to  provide  a  guide¬ 
line  for  making  right  tradeoffs  for  the  multi-port  memory  design. 

3.3.  On-chip  DRAM  caches 

3.3.1.  Why  DRAMs? 

Static  memories  have  been  popular  for  on-chip  memories  because  they  do  not  require 
jjeriodic  refreshing  or  complex  control  circuitry.  Dynamic  memories  need  a  periodic  refresh¬ 
ing  before  the  dynamic  charge  storage  node  loses  its  voltage  level  due  to  leakage  current 
inevitable  in  silicon  technology.  The  refreshing  requirement  of  dynamic  memories  may  inter¬ 
fere  with  the  processor’s  normal  operations,  and  thus  they  have  been  used  rarely  in  a  single 
chip  microprocessor.  However,  the  density  that  single-transistor  (IT)  or  3-transistor  (3T) 
dynamic  memory  offers  now  stimulates  designers  to  consider  using  dynamic  memory  as  an 
on-chip  memory. 

The  size  of  a  SRAM  ceU  used  in  microprocessors,  typically  6T  SRAM  cell  or  its  vari¬ 
ants,  is  about  four  to  eight  times  larger  in  area  than  that  of  some  (3T  or  IT)  dynamic  cells. 
For  example,  the  6T  SRAM  cell  used  in  the  SPUR  CPU  to  implement  the  instruction  cache 
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and  a  3T  dynamic  memory  cell  are  compared  in  Figure  3-1  (each  layout  contains  four  bits 
sharing  power  lines  and  bit  line  contacts).  As  discussed  in  the  previous  section,  the  size  of 
the  local  memory  is  one  of  the  most  important  design  parameters.  Therefore,  it  is  quite 
appealing  to  use  dynamic  memories  in  place  of  static  memory  where  increase  in  local 
memory  size  is  crucial.  With  dynamic  memories,  such  as  3T  DRAMs,  the  size  of  the  local 
memory  at  a  given  chip  area  can  easily  be  quadrupled  or  increased  even  more.  The  IT  or  4T 
DRAM  cells  are  not  as  useful  as  3T  ceU  for  a  single  chip  microprocessor  since  they  may 
require  special  fabrication  process  (IT)  or  ratioed  design  (4T).  Ratioing  transistors  may  result 
in  large  cell  area. 

As  we  replace  static  memory  with  dynamic  memory,  more  memory  cells  are  integrated 
into  the  same  area.  In  order  to  increase  the  overall  performance,  however,  the  speed  or  access 
time  of  the  memory  array  must  remain  relatively  constant  over  this  change.  It  is  the  density 
(and  hence  the  logical  size  of  memory)  that  increases  with  dynamic  memory,  but  not  the  phy¬ 
sical  size  or  the  area  of  memory  (parasitics  are  dominant  factors  in  memory  access  delays). 
Therefore,  with  careful  layout  of  the  cell  and  good  circuit  design  techniques,  dynamic 
memory  integrated  in  a  given  area  can  be  as  fast  (especially  for  3T  dynamic  cell)  as  static 
memory  integrated  in  the  same  area,  but  with  higher  density. 

3.3.2.  Limitations  of  DRAMs  for  an  on-chip  memory 

On-chip  use  of  DRAM  has  serious  drawbacks  due  to  the  difficulty  of  implementation 
using  standard  process  technology,  and  reliability  issues,  such  as  refreshing  requirement,  and 
hard  and  soft  errors  of  the  DRAM  (see  below  for  a  further  explanation).  The  three  most  com¬ 
mon  types  of  DRAM  cells  are  l-transistor,  3-transistor,  and  4-transistor  DRAM  cells,  as 
shown  in  Figure  3-2.  The  high  density,  state  of  the  art  DRAMs  use  IT  DRAM  cells.  The  pro¬ 
cess  technology  for  such  a  high  density  DRAM  is  quite  different  from  the  process  technology 
in  which  microprocessors  are  fabricated.  Furthermore,  the  access  time  of  IT  DRAM  is  usu¬ 
ally  much  slower  than  3T  or  4T  DRAMs  due  to  slow  sensing  delay.  The  3T  or  4T  dynamic 
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Figure  3-1.  Comparison  of  CMOS  6T  SRAM  cell  (of  SPUR  CPU)  and  3T  DRAM  cell 

cells  do  not  require  special  process  technology  and  can  be  easily  integrated  into  the  single 
chip  microprocessor.  Moreover,  they  are  more  tolerant  of  process  variations  than  IT  cells. 
The  high  density  SRAM  (4T  and  2R  loads)  often  uses  special  fabrication  process  to  reduce 
the  cell  area  by  using  load  devices  made  of  high  resistance  poly  resistors). 

Dynamic  memory  stores  information  as  a  change  on  an  isolated  capacitive  node.  The 
charge  on  this  node  leaks  away  if  left  isolated  for  long  time  due  to  the  leakage  current  associ¬ 
ated  with  necessary  silicon  pn  junctions.  The  information  stored  on  a  storage  node  may  be 
lost  if  the  charge  leaks  away  too  much.  A  refreshing  operation  reads  the  information  before  it 
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is  degraded,  and  restores  the  charge  to  its  original  level.  The  refresh  interval  of  dynamic 
memory  can  be  determined  by: 


wwn  f  Qdynamie  rtod* 

T„f  = - 


•Leakazt 


where/is  a  fraction  allowed  to  be  lost  due  to  the  leakage  current 

With  current  technology,  a  dynamic  storage  node  must  be  refreshed  in  as  little  as  a  one 
to  two  millisecond  period.  As  capacitance  on  the  storage  node  decreases,  the  refresh  interval 
must  be  shortened.  In  the  following  section,  I  will  present  a  method  to  overcome  the  refresh¬ 
ing  overhead  of  dynamic  memory. 

Hard  errors  are  usually  originated  from  fabrication  defects.  It  is  therefore  more  prob¬ 
able  to  have  hard  errors  when  a  larger  chip  area  is  devoted  to  the  on-chip  memory.  Soft 
errors  are  induced  by  alpha  particles  or  cosmic  rays  and  are  a  well-known  phenomena  in  the 
use  of  the  DRAM.  To  make  dynamic  memory  on  a  microprocessor  chip  safe  and  efficient, 
these  problems  must  be  overcome.  Error  detection  and  correction  codes  are  extensively  used 
to  improve  the  reliability  of  dynamic  memory  systems,  to  handle  hard  and  soft  errors. 
Recently,  some  single  chip  DRAMs  integrated  these  error  detection  and  correction  schemes 
on  the  chip  [Yam84][Man87].  It  may  be  desirable  to  have  a  simple  form  of  these  schemes  in 
microprocessors  with  dynamic  memory. 


3.3.3.  Non-refreshing  DRAMs  for  on-chip  caches 

The  integrity  of  the  data  stored  in  dynamic  memory  can  only  be  assured  by  periodic 
refreshing.  Ideally,  refreshing  should  be  done  without  affecting  the  processor’s  execution 
stream.  Some  microprocessors  use  software  refreshes  which  are  sometimes  called  refresh 
hiccups.  When  a  timer  interrupt  occurs  indicating  a  refresh  interval,  the  microprocessor’s 
control  stops  the  on-going  operation  to  refresh  the  dynamic  memory.  This  scheme,  however, 
is  unacceptable  because  of  the  effort  required  to  make  sure  all  systems  using  tliis  processor 
have  tlie  proper  interrupt  handler.  It  also  degrades  the  processor’s  performance  by  interfering 
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with  the  processor’s  normal  execution  stream  and  by  taking  as  many  cycles  as  needed  for 
refreshing  the  dynamic  memory.  Using  simple  circuit  techniques  and  a  few  modificaUons  in 
cache  design  can  effectively  alleviate  these  problems.  Two  schemes  that  eliminate  the 
refreshing  overhead  of  dynamic  memory  have  been  devised  for  implementation  in  hardware. 
These  are:  (1)  invalidate  on  every  refresh  interval;  and  (2)  selective  invalidation  on  every 
refresh  interval.  These  schemes  are  based  on  the  following  assumptions  and  restrictions 
[Hil87a]: 

(1)  The  cache  contains  copies  of  instructions  or  data,  which  also  reside  elsewhere  such  as 
external  cache  or  main  memory  (e.g.  instruction  cache  or  write-through  data  cache). 

(2)  The  cache  contains  a  number  of  blocks  consisting  of  an  address  tag  and  one  or  more 
sub-blocks;  associated  with  each  sub-block  is  a  VALID  bit,  so  that  any  subset  of 
block’s  sub-blocks  may  be  valid. 

(3)  The  sub-block  is  the  unit  of  transfer  from  off-chip  into  the  on-chip  cache. 

(4)  All  VALID  bits  associated  with  address  tags  or  sub-blocks  can  be  reset  in  parallel  to 
invalidate  the  cache. 

(5)  Any  access  to  dynamic  memory  is  considered  as  a  refresh  (read  is  always  followed  by 
write-back  in  DRAM’s).  In  other  words,  the  cache  entry  accessed  during  the  last  refresh 
interval  need  not  be  refreshed  until  the  end  of  the  next  interval. 

DRAMs,  if  used  as  a  cache  on  a  microprocessor  chip  under  the  above  assumptions, 
need  not  be  refreshed  periodically.  Instead,  the  cache  may  get  invalidated  once  at  each  refresh 
interval.  Most  microprocessors  with  on-chip  caches  have  a  privileged  instruction  or  other 
ways  to  invalidate  their  caches  (sec  assumption  4  above).  Thus,  in  tlte  expense  of  a  timer 
(frequcncy/clock  counter)  one  can  easily  implement  this  scheme.  This  scheme,  however, 
degrades  the  processor’s  performance  by  invalidating  the  active  cache  entirely  every  few  mil¬ 
liseconds.  This  invalidation  of  the  cache  at  Uic  end  of  every  refresh  interval  is  referred  to  as 
the  first  scheme,  invalidate  on  every  refresh  interval.  The  refresh  or  invalidation  interval  for 
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this  scheme  can  be  as  long  as  that  of  the  refresh  period  of  dynamic  memory. 

The  next  scheme,  the  selective  invalidation,  is  more  elaborate  than  the  first.  An  extra  bit 
per  each  sub-block,  in  addition  to  the  VALID  bit.  is  used  to  hold  a  refreshing  status.  This  bit 
is  set  at  the  beginning  of  each  refresh  interval,  and  reset  selectively  whenever  the  correspond¬ 
ing  sub-block  is  accessed  regardless  of  read  or  write  access.  At  the  end  of  the  refresh  period. 
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(a)  Cache  state  at  the  beginning  of  the  refresh  interval 
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(b)  Cache  state  at  the  end  of  the  refresh  interval  with  selective  invalidation 


Figure  3-3.  Dynamic  RAM  cache  -  selective  invalidation 
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all  of  these  bits  and  VALID  bits  are  examined  and  only  those  sub-blocks  not  accessed  during 
the  last  interval  and  still  valid  are  invalidated  selectively  (see  Figure  3-3).  Invalidating  the 
cache  to  preserve  data  integrity  is  no  longer  necessary.  The  fact  that  cache  entries  not 
accessed  for  a  long  time  may  not  be  needed  in  the  future  (cf.  temporal  locality)  makes  the 
DRAM  cache  performance  very  close  to  that  of  the  SRAM  cache.  The  performance  of  the 
DRAM  cache  with  the  selective  invalidation  is  evaluated  in  the  following  section  using  trace 
driven  cache  simulations.  This  selective  invalidation  can  be  implemented  with  a  small  (six 
transistors  per  sub-block)  circuit  as  shown  in  Figure  3-4.  It  can  be  easily  extended  to  incor¬ 
porate  separate  or  multiple  word  lines  if  required,  by  adding  one  transistor  per  word  line  (see 
dotted  transistor  in  the  figure). 

Instruction  caches  are  usually  read-only  (written  only  when  there  is  a  miss  to  replace 
the  missed  block  or  sub-block),  and  one  of  the  above  methods  can  easily  be  employed  if 
DRAM’S  are  used  to  implement  the  cache.  To  expand  the  usage  of  these  methods  to  the  data 
cache,  the  cache  must  adapt  the  write-through  policy  for  storing  new  value  into  its  entry. 
Since  all  or  any  subset  of  cache  entries  may  be  invalidated  at  any  time  with  one  of  these 
schemes,  any  newly-written  cache  entry  must  be  stored  in  a  safe  place  (main  memory  or 
external  cache).  With  the  write-through  policy,  subsequent  accesses  to  the  invalidated  cache 
entries  will  miss  and  eventually  retrieve  the  the  correct  data  from  the  saved  place.  For  a 
multi-level  cache  design,  a  write-through  policy  for  an  on-chip  data  cache  (highest  level  in 
the  hierarchy)  is  a  reasonable  choice  since  it  is  the  simplest  method  to  assure  data  consistency 
among  caches  in  the  hierarchy.  A  better  write  policy  can  still  be  applied  to  the  external  cache. 

3.3.4.  Evaluations 

Two  methods  of  using  DRAM’s  without  actual  refreshing  of  memory  cells  are 
evaluated  in  this  section.  I  use  the  instruction  cache  of  the  SPUR  CPU  (the  SPUR  IB)  as  an 
example  for  die  evaluation.  Although  it  is  optimized  for  the  given  constraints  of  the  SPUR 
CPU  architecture,  the  SPUR  IB  is  a  good  representative  model  among  different  instruction 
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Figure  3-4.  Circuits  implementing  the  selective  invalidation 


cache  organizations  for  single-chip  microprocessors  and  hence  is  chosen  for  this  evaluation. 
The  SPUR  IB  was  implemented  using  6T  CMOS  SRAM  cells.  I  will  compare  the  perfor¬ 
mance  of  the  SPUR  IB  to  that  of  the  SPUR  IB  implemented  using  DRAM  cache  witli  each  of 
the  above  two  melliods,  eliminating  the  refreshing  overhead. 

I  use  the  miss  ratio  as  a  performance  measure  for  different  caches.  Trace  driven  cache 
simulations  that  directly  compute  miss  ratios  of  caches  with  different  parameters  are 
employed  here.  Other  performance  measures  such  as  effective  access  time  [Hil87b]  may  be 
easily  computed  from  the  miss  ratio  determined  in  this  evaluation.  The  same  traces  used  in 
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Figure  3-5.  SPUR  instruction  cache  organization 


the  SPUR  TB  design  [Hil87b]  are  used  here  also.  Those  traces  are: 

(1)  Weaver,  a  production  system  written  on  top  of  OPS5  for  VLSI  chip  routing  [Joo85]: 

(2)  Rsim,  a  switch-level  simulator  simulating  a  counter  [Ter83]; 

(3)  Sic,  the  SPUR  Lisp  compiler  [ZHH87],  based  on  the  SPICE  Lisp  [THL86],  compiling 
part  of  itself. 

For  each  of  these  programs,  two  500K-instruction  dynamic  trace  sets  showing  different 
behaviors  (medium  and  pessimistic)  were  collected  (a  total  of  six  traces  or  three  million 
instructions)  [Hil87b].  Since  miss  ratio  variation  across  the  trace  samples  is  small,  subse¬ 
quent  results  are  based  on  miss  ratios  for  a  composite  trace,  formed  by  concatenating  the  six 
traces.  Tlie  lengths  of  traces  arc  the  same,  and  so  the  miss  ratio  for  the  composite  trace  is 
equal  to  the  aritlimetic  average  of  miss  ratios  from  tlie  individual  traces. 

The  SPUR  IB  is  an  on-chip  instruction  cache  organized  in  16  blocks  with  eight  sub¬ 
blocks  in  each  block  as  shown  in  Figure  3-5.  The  size  of  a  sub-block  is  4-byte,  or  one  32-bit 
instniction,  which  is  the  off-chip  data  transfer  size  of  the  CPU  chip.  Associated  with  each 
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sub-blcxk,  or  an  instruction  word,  is  a  valid  bit  so  that  any  subset  of  instructions  within  a 
block  may  be  valid.  The  SPUR  IB  uses  this  flexible  feature  to  reduce  demand  miss  time  by 
loading  only  the  fetched  instruction  rather  than  the  entire  block,  and  to  permit  instruction  pre¬ 
fetching  to  load  the  rest  of  a  block  in  parallel  with  subsequent  instruction  fetches. 

The  architectural  parameter  that  has  the  greatest  impact  on  SPUR  IB  miss  ratio  is  cache 
size.  In  the  evaluation  of  DRAM  caches.  I  keep  all  cache  parameters  of  the  SPUR  IB 
imchanged  except  the  cache  size  because  it  can  vary  with  the  DRAM  implementation.  Other 
parameters  such  as  block  (or  sub-block)  size,  off-chip  bandwidth  (line  size),  and  prefetch 
algorithm,  will  affect  the  cache  performance,  but  their  effects  are  the  same  for  both  SRAM 
and  DRAM  implementations.  Table  3-1  shows  the  demand  miss  ratios  of  SPUR  IB  for  six 
traces  and  Figure  3-6  plots  the  average  miss  ratio  for  different  sizes  of  cache  implemented 
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16384 
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2.42 
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3.01 
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0.36 

0.81 

1.98 

0.53 

3.18 

4.52 
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Table  3-1.  Demand  miss  ratios  (%)  of  SPUR  IB  for  6  traces 
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using  SRAMs. 

The  evaluation  of  the  first  scheme,  invalidating  the  cache  at  the  end  of  every  refresh 
interval,  focuses  on  the  effect  of  invalidation.  The  cache  simulator,  Dinero  [Hil85],  was 
slighdy  modified  to  simulate  DRAM  cache.  A  real  cycle  counter  that  counts  not  only  refer¬ 
ences  but  also  miss  time  and  cycles  lost  to  others,  was  used  to  invalidate  the  cache  at  accurate 
intervals.  The  maximum  degradation  in  performance  (increase  in  miss  rate)  due  to  the  invali¬ 
dation  of  DRAM  cache  can  be  determined  by: 


miss  ratio 


Figure  3-6.  Demand  miss  ratio  of  SPUR  IB  with  different  sizes  of  cache 
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where  DMR  is  a  maximum  miss  ratio  increase  (degradation  due  to  invalidations),  is 

a  total  number  of  sub-blocks  in  the  cache,  and  N„f  is  the  average  reference  counts  per  refresh 
interval.  The  average  reference  counts  can  be  estimated  by: 

„  Q _ _ 

(1  -I- M  X  mss_ratio  +  C) 

where  Q  is  the  refresh  interval  in  number  of  cycles  (e.g.  20, OCX)  cycles),  M  is  a  miss  time,  and 
C  is  the  cycles  per  instruction  lost  for  other  reasons  (e.g.,  external  cache  misses).  With  a 
cycle  of  100  nsec  (the  SPUR  CPU’s),  the  refresh  interval  is  set  to  20,000  cycles  or  2  mil¬ 
liseconds  for  all  simulations.  Figure  3-7  plots  the  maximum  possible  increase  of  miss  ratio 
for  different  sizes  of  caches.  Although  maximum  bound  set  by  the  above  equations  is  enor¬ 
mous  for  large  caches,  the  actual  difference  in  miss  ratio  between  DRAM  and  SRAM  caches 
is  much  less  than  the  maximum.  In  fact  many  cache  entries  may  not  be  used  again  later  and 
will  eventually  be  invalidated.  Invalidating  those  cache  entries  may  not  degrade  the  cache 
performance,  and  hence  the  real  difference  in  miss  ratio  is  far  less  than  the  maximum 
predicted  in  Figure  3-7.  The  simulation  run  on  the  six  traces  mentioned  above  reveals  this 
fact. 

Figure  3-8  compares  the  miss  ratio  of  a  DRAM  cache  with  the  invalidation  scheme  to 
that  of  conventional  SRAM  cache  as  a  function  of  cache  sizes.  Difference  in  miss  ratios  is 
negligible  for  small  caches,  but  becomes  substantial  as  cache  size  increases.  More  impor¬ 
tantly,  above  a  certain  cache  size  the  performance  improvement  by  increasing  tlie  cache  size 
diminishes  (as  indicated  by  an  arrow  in  Figure  3-8),  due  to  the  frequent  invalidations. 

The  selective  invalidation  scheme  improves  the  cache  performance  by  not  invalidating 
the  entire  cache,  instead  invalidating  only  those  entries  not  fresh  and  still  valid.  The  cache 
simulator  is  also  modified  to  correctly  incorporate  tlie  selective  invalidation.  First,  the  refresh 
bit  is  added  to  each  sub-block  (or  each  access  unit)  structure,  and  new  operation  (selective 
invalidation)  is  added.  With  a  timer  (cycle  counter)  intcnxipt  indicating  refresh  time,  refresh 
and  valid  bits  of  each  entry  (or  sub-block)  arc  examined  and  invalidated,  if  necessary.  The 
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miss  ratio 


Figure  3-8.  Effect  of  periodic  invaiidations  on  miss  ratio 


refresh  interval  must  be  a  half  of  the  required  interval  since  some  cells  may  hold  valid  data 
from  tlie  beginning  of  one  interval  through  the  end  of  the  next 

The  results  of  cache  simulation  run  on  the  composite  trace  are  shown  in  Table  3-2.  The 
miss  ratio  versus  size  is  plotted  in  Figure  3-9,  and  compared  to  the  SRAM  cache  perfor¬ 
mance.  The  difference  in  performance  is  greatly  reduced  for  even  large  caches  by  using 
selective  invalidation.  The  miss  ratio  difference  in  large  caches  indicates  that  there  are  some 
cache  entries  with  a  very  long  lifetime  but  not  accessed  often,  or  there  are  some  entries  active 
at  intervals  greater  than  the  refresh  period.  This  may  depend  on  the  referencing  behavior  of 
program  or  data. 
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Cache  Size 
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Invalidation 
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22.78 

22.88 
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2048 

13.15 

13.69 

13.15 

4096 

9.75 

10.82 

9.77 

8192 

4.89 

6.95 

4.97 

16384 

3.01 

6.02 

3.25 

32768 

1.90 

5.53 

2.26 

Table  3-2.  Miss  ratio  (%)  of  DRAM  cache  (SPUR  IB,  sub-block  =  4  bytes) 
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-  -  -  Selective  invalidation  (DRAM  cache) 

Figure  3-9.  Effect  of  selective  invalidations  on  miss  ratio 


The  second  most  influential  cache  parameter  on  miss  ratio  of  IB  after  size,  is  the  size  of 
the  sub-block-  Figure  3-10  shows  a  plot  similar  to  Figure  3-9  for  SPUR  IB  with  twice  the 
sub-block  size.  Such  an  improvement  done  on  the  SRAM  cache  will  woik  equally  on  the 
DRAM  cache. 
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miss  ratio 


-  No  invalidation  (SRAM  cache) 

.  Periodic  invalidation  (DRAM  cache) 

- Selective  invalidation  (DRAM  cache) 

Figure  3-10.  EfTect  of  selective  invalidations  for  different  size  of  sub-block 


3.4.  Multi-port  Memory  Design 
3.4.1.  Multiple  functional  units 

To  improve  the  performance  of  a  microprocessor,  designers  often  look  to  approaches 
Uiat  pennit  parallelism,  or  overlap,  in  the  instruction  execution  stream.  Traditionally,  pipelin¬ 
ing  has  been  one  of  the  most  popular  of  these  approaches.  Another  technique  that  can  be  used 
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independently  or  to  complement  pipelining,  is  the  use  of  multiple  funaional  units.  In  either 
case,  the  application  of  such  approaches  can  lead  to  a  substantial  improvement  in  a 
processor’s  maximum  performance  because  the  total  computational  resources  that  are  simul¬ 
taneously  available  to  a  running  program  is  increased.  However,  a  common  resource  such  as 
memory  (or  on-chip  local  memory  in  the  case  of  single  chip  processor),  can  become  a  perfor¬ 
mance  bottleneck  unless  enough  bandwidth  between  the  memory  elements  and  several  func¬ 
tional  units  is  provided. 

It  is  well  known  that  the  size  of  the  local  memory  must  be  large  if  the  computational 
bandwidth  of  the  processing  elements  is  large,  as  represented  by  the  "Amdahl’s  rule 
[SBN82].  Furthermore,  a  well-designed  microprocessor  must  provide  "balanced"  or 
"matched"  bandwidth  required  by  both  local  memory  and  functional  units.  This  matching  of 
the  bandwidth  is  dependent  upon  an  instruction  set  architecture  (especially  instruction  for¬ 
mat)  as  weU  as  speed  of  circuits  [Kuc78].  Together  these  two  factors  determine  the  required 
bandwidth  from  the  memory  hierarchy.  Given  that  the  single-chip  microprocessor  with  one 
functional  unit  has  a  balanced  bandwidth,  if  the  number  of  functional  units  is  increased  by  a 
factor  of  ot,  the  local  memory  bandwidth  also  must  be  increased  by  the  same  factor,  a 
(without  any  other  optimization),  to  rebalance  the  processing  capacity  of  multiple  functional 

units. 

3.4.2.  Multiple  sets  of  register  files  and  multi-port  cache  memory 

There  are  several  ways  to  increase  the  bandwidth  between  the  local  memory  and  multi¬ 
ple  functional  units.  Two  prominent  approaches  are:  (1)  use  multiple  memories  such  that 
multiple  funcdonal  units  can  access  at  least  one  of  them  simultaneously;  and  (2)  use  a  multi- 
port  memory  such  that  multiple  functional  units  can  simultaneously  access  the  common  local 
memory.  As  previously  mentioned  in  Section  2,  many  different  on-chip  memory  organiza¬ 
tions  arc  possible  with  any  of  tlicse  two  approaches.  However,  some  memory  organizations 
arc  particularly  well  suited  to  one  of  these  approaches,  while  others  are  not.  For  instance. 
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having  multiple  caches  can  create  cache  consistency  problems,  even  among  local  on-chip 
cache  memories. 

Multiple  sets  (or  banks)  of  register  files  (Figure  3-1  la)  would  be  a  better  choice  for  the 
first  approach.  This  is  because  the  use  of  registers  can  be  controlled  to  some  extent  by  the 
programmer  (or  compiler),  hence  data  consistency  among  register  sets  (banks)  is  not  required 
or  at  least  can  be  maintained  by  the  software.  With  the  multi-port  memory  approach  (Figure 
3-1  lb),  using  local  memory  either  as  a  cache  memory  or  registers  would  be  acceptable, 
altliough  a  cache  memory  in  one  form  or  another  would  be  a  better  choice  since  it  does  not 
require  optimizations  from  a  programmer  or  a  compiler  (also  there  is  no  need  for  cache  con¬ 
sistency).  The  selection  of  cache  memory  type  or  register  organization  strongly  depends  on 
architectural  constraints  as  well  as  area  and  speed  requirements. 

To  make  the  best  use  of  the  local  memory  following  the  above  two  approaches,  careful 
performance  tradeoffs  among  different  memory  organizations  should  be  made.  The  perfor¬ 
mance  tradeoffs  can  span  from  the  compiler  design  (optimal  register  allocation)  to  the  actual 
implementation  of  the  memory  for  multiple  functional  units.  Within  this  research,  however, 
only  implementation  tradeoffs,  such  as  area  required  and  access  times  difference  among  dif¬ 
ferent  memory  designs,  are  considered. 

Several  different  memory  cells  and  analysis  techniques  to  evaluate  them  have  been  pro¬ 
posed  for  a  multi-port  memory.  The  next  section  reviews  some  of  those  multi-port  memory 
cells  first,  then  yet  another  possible  circuit  design  technique  for  a  multi-port  memory  cell  is 
proposed.  Two  local  memory  organizations,  multiple  sets  of  register  files  and  multi-port 
cache  implemented  using  the  proposed  memory  cell,  are  chosen  for  each  of  the  above  two 
approaches,  so  that  a  direct  comparison  (multiple  set  versus  multi-port)  between  two 
approaches  can  be  made.  Tlie  objective  of  this  comparison  is  to  determine  a  feasibility  of  N- 
port  memory  based  on  the  cell  proposed  (where  N  can  be  greater  tlian  two),  relative  to  the 
multiple  set  approach. 
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(a)  Multiple  register  files 


(b)  Multi-port  memory 

Figure  3-11.  On-chip  local  memory  organizations  for  multiple  functional  units 
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3.4.3.  Multi-port  memory  cells 

Cross-coupled  inverters  have  long  been  used  as  a  static  storage  elements  because  these 
regenerative  circuits  are  stable,  compact,  and  more  reliable  than  any  other  static  cell.  A  con¬ 
ventional  6-transistor  SRAM  cell  uses  the  cross-coupled  inverters  and  two  access  transistors. 
The  access  transistor  connecting  the  bit  line  and  storage  node  is  controlled  by  the  word 
(select)  line.  A  single-ported  memory  cell  is  accessed  differentially  from  both  bit  lines  for  a 
read  or  a  write  per  cycle.  Several  kinds  of  CMOS  dual-ported  memory  cells  (read-read,  read- 
write,  or  write-write  per  cycle)  based  on  this  cross-coupled  inverter  cell  have  been  used  in 
microprocessor  chips  or  other  applications  for  many  years.  These  are  shown  in  Figure  3-12. 


(c)  9T  pscudo-siatic  dual-port  memory  cell 


word  select  A 


word  select  B 


Vdd 


(b)  Singlc-cndcd  access  dual-port  memory  cell 
word  select  A 


(d)  Pseudo-static  dual-port  memory  cell  with  clocldng 


Figure  3-12.  Multi-port  memory  cells 
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A  single-ended  access  cell  such  as  in  Figure  3-12  (b)  is  more  compact  than  the  differen- 
tiaUy  accessed  ceU  of  Figure  3-12  (a),  but  requires  a  boosted  word  line  to  reliably  write  the 
storage  node  to  a  high  state.  Using  precharged  bit  lines,  each  port  of  the  single-ended  access 
cell  can  perform  independent  read  operations  even  with  small  transistors.  A  modified  version 
of  this  single  ended  cell  (single  ended  read  accesses  and  differential  write  access  per  cycle)  is 
used  for  the  register  files  of  both  the  SPUR  CPU  and  the  RISCTI  designs  [She84].  The 
pseudo  static  cell  design  approach  requires  extra  transistors  to  break  the  regenerative  feed¬ 
back  action,  which  in  turn  makes  the  single-ended  write  operation  performed  simple  and  safe. 

To  extend  a  dual-ported  memory  cell  to  an  n-port  cell  where  n  is  greater  than  two,  n  or 
2n  (depending  on  the  configuration)  extra  access  devices  and  associated  bit  lines  and  word 
lines  can  simply  be  added  in  the  same  way  the  original  access  devices  are  connected.  How¬ 
ever,  area  increase  due  to  this  addition  and  a  resulting  slow  access  time  as  well  as  reduced 
safety  of  operation  (noise  margin)  complicate  the  design  of  a  multi-port  memory  cell.  A 
number  of  methods  have  been  proposed  to  characterize  the  cross-coupled  memory  cell  in 
various  aspects,  such  as  simulation  based  analysis  [0’C871  and  static  noise  margin  analysis 
of  read/write  operation  with  both  single-ended  and  differential  accesses  [SLL87][Nak88]. 
Static  noise  margin  (SNM)  of  a  static  memory  cell  is  defined  as  the  maximum  value  of  static 
noise  (dc  disturbance  such  as  offsets  and  mismatches  due  to  processing  and  operating  condi¬ 
tions)  that  can  be  tolerated  by  the  memory  cell  (cross-coupled  inverter  flip-flop)  itself  before 
changing  its  states  accidentally. 

Next,  I  propose  yet  another  circuit  for  a  compact  and  efficient  multi-port  memory  cell 
based  on  the  6T  CMOS  single-ended  access  cell  approach  mentioned  above.  One  major 
drawback  of  the  single-ended  access  cell  is  requiring  a  boosted  word  line  (above  Vdd)  to  per¬ 
form  a  write  operation  safely.  In  CMOS  design,  bootstrap  circuits  of  NMOS  can  be  built,  but 
may  have  some  disadvantages  which  make  them  difficult  to  implement.  First,  junction  break¬ 
down  is  more  probable  because  of  a  higher  operating  voltage  when  boosted.  Tliis  becomes  a 
more  serious  problem  as  minimum  dimensions  shrink  with  technological  advances.  Secondly, 


72 


I 


the  bootstrap  capacitor  rnay  require  a  substantial  amount  of  area  because  the  bootstrap  capa¬ 
citance  must  be  comparable  to  the  capacitive  loading  of  the  word  line. 

To  reduce  the  complexity  in  circuit  design  associated  with  the  bootstrap  driver,  the  vol¬ 
tage  level  on  power  supply  line  (Vdd)  can  be  reduced  to  a  lower  level  instead,  such  as  3  V. 
The  word  line  operating  voltage  for  a  read  can  be  at  around  3  V  while  that  for  a  write  can  be 
at  5  V,  hence  there  is  no  need  for  a  bootstrap  driver.  Reducing  the  voltage  level  on  the  sup¬ 
ply  line  may  affect  the  static  noise  margin  or  the  stability  of  a  cell.  With  a  careful  layout  and 
proper  ratioing  of  internal  transistor  sizes,  the  static  noise  margin  can  be  improved. 

The  proposed  multi-port  cell  using  this  technique  is  shown  in  Figure  3-13.  This  cell  can 
provide  2n  reads  and  writes  (single-ended  accesses  for  both  reads  and  writes).  For  each 
access,  separate  word  (row)  select  and  bit  lines  are  provided.  Any  combination  of  reads  and 
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Figure  3-13.  The  single-ended  access  multi-port  cell 
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writes  (total  of  2n  operations)  accesses  to  the  array  can  be  performed  in  each  cycle,  provided 
that  writes  at  one  or  other  ports  are  unambiguously  resolved  without  corrupting  the  cell  data. 
The  internal  forwarding  scheme  used  in  the  SPUR  CPU  (see  Chapter  2,  Section  3),  can  be 
used  to  resolve  read  and  write  conflicts  on  a  cell.  As  more  ports  are  added,  each  register  must 
sink  more  current  to  keep  the  access  time  constant  with  increasing  capacitance  on  added  bit 
lines.  This,  in  turn,  increases  the  cell  area  because  all  transistors  must  be  scaled  up  accord¬ 
ingly.  The  maximum  number  of  ports  that  can  be  attached  to  this  ceU  with  an  acceptable 
access  time  and  a  reasonable  cell  area  is  about  ten  (n  =  5). 

3.4.4.  Analysis  and  comparison 

Three  aspects  are  important  for  multi-port  memory  design;  the  ceU  area,  the  access 
time,  and  the  stability  of  the  ceU.  The  ceU  area  determines  the  size  (density)  of  the  local 
memory  and  often  diiecUy  relates  the  access  time  of  memory  array.  The  stability  of  the 
memory  ccU  determines  the  sensitivity  of  the  memory  to  process  tolerances  and  operating 
conditions.  Considerable  research  has  been  performed  in  the  past  to  analyze  the  stability  of 
cross-coupled  inverter  cells  ISLL87]  [Lis86][JcF85]lNak88].  RecenUy,  an  analytical 
approach  to  modeling  the  stability  of  the  flip-flop  cell  has  been  reported  [SLL87].  The  static 
noise  margin,  as  defined  in  the  previous  section,  is  used  as  a  stability  measure  in  that 
approach.  The  analytical  expression  of  the  SNM  has  been  further  developed  for  various 
multi-port  memory  cells  in  [Nak88].  The  SNM  calculated  in  this  analysis  uses  the  expression 
from  [Nak88]  for  the  single-ended  access  memory  ceU. 

Analysis  of  the  proposed  multi-port  memory  ceU  is  done  mainly  by  using  circuit  simu¬ 
lations  (SPICE).  The  advantages  of  using  simulation  over  just  relying  on  static  noise  margin 
analysis  are:  (1)  timing  information  is  available;  (2)  parasitics  can  be  taken  into  account,  and 
(3)  actual  device  design  parameters  are  available  to  accurately  estimate  the  area  required. 
Including  parasitics,  such  as  bit  line  capacitance,  in  memory  design  is  very  important  since 
parasitics  have  a  dominant  effect  on  the  access  time  of  memory.  In  the  simulations  performed 
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here,  all  parasitics  are  adjusted  according  to  the  configuration. 

The  feasibility  of  multi-port  cache  memory  on  a  chip  as  compared  to  multiple  sets  of 
register  files,  depends  on  the  memory  cell  used.  If  a  multi-port  memory  cell  is  too  large  or  too 
slow  in  comparison  to  the  cell  used  for  a  common  register  file  (for  example,  dual-ported  read 
and  single  ported  write),  it  may  not  be  advantageous  to  have  multi-port  cache  memory.  When 
the  total  area  of  a  multi-port  cache  memory  array  is  less  than  the  total  area  occupied  by  multi¬ 
ple  register  files  having  the  same  number  of  ports  accessible  from  functional  imits,  use  of 
multi-port  cache  memory  is  justified.  However,  most  multi-port  memory  cells  result  in  much 
larger  area  than  a  simple,  compact  memory  cell  with  a  single  or  dual  port.  Using  the  single- 
ended  access  ceU  with  reduced  supply  voltage  as  proposed  for  a  multi-port  memory  cell  can 
be  more  area-efficient  than  other  multi-port  cells,  while  maintaining  reasonable  access  time 
and  noise  margin  characteristics. 

Table  3-3  shows  several  design  parameters  for  a  single-ended  memory  cell,  when  used 
for  two  or  more  ports  to  memory.  Device  parameters  of  the  cell  in  each  configuration  are 
designed  to  have  minimal  area,  fast  access  time,  and  ample  noise  margin.  Access  times  are 
drawn  from  the  circuit  simulation  (worst  cases)  with  bit  lines  precharged  at  3.0  V  prior  to  a 
read  and  with  bit  line  and  word  line  capacitive  loadings  adjusted  for  the  additional  number  of 
ports  (1.0  pF  initially  with  2-ports).  All  operations  arc  done  without  using  a  sense  amplifier. 
Therefore,  access  time  is  directly  related  to  the  size  of  the  pull-down  transistor  in  the  cell. 
Area  estimates  are  derived  from  total  gate  area  of  transistors  in  the  cell.  As  the  number  of 
ports  increase,  the  cell  area  is  dominated  by  pull-down  transistors  and  access  transistors. 
Static  noise  margin  is  calculated  using  transistor  parameters  obtained  in  the  simulation,  and 
plug  them  into  the  analytical  expression  mentioned  above  [Nak.88].  Static  noise  margin  is  a 
function  of  operating  voltages  and  ratios  of  transistors  within  the  cell. 

As  the  number  of  ports  increase,  Uie  write  delay  also  increases.  This  indicates  that  the 
write  operation  is  getting  more  difficult  as  more  ports  are  attached.  Conversely,  the  read 
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Memory  cell 

#  ports 

Pp 

Pi 

P- 

area 

estimates 

(m 

read 

delay 

fnsec) 

write 

delay 

(nsec) 

static 

noise  margin 
(mV) 

A.  Register  cell 
single-ended  read 
differential  write 
V/jd=5.0 
^ read_stUct  “5.0 
Vymtt_S€l€Ct  =5.0 

2-port  read 
or 

1-port  write 

84  (1.0) 

11.5 

2.0 

355 

B.  Multi-port  cell 
single-ended  read 
single-ended  write 
Vx)fl=3.0 
^ Ttad_s*Uct  =3.0 

^wrii€_stl*cl  =5.0 

2-port 

3/6 

8/2 

4/2 

84(1.0) 

11.5 

6.0 

355 

4-port 

3/3 

16/2 

6/2 

136(1.6) 

9.0 

8.5 

524 

6-port 

4/2 

22/2 

8/2 

200  (2.4) 

1.S 

13.0 

474 

8 -port 

8/2 

30/2 

11/2 

328  (3.9) 

7.2 

18.0 

491 

10-port 

10/2 

45/2 

16/2 

410(4.9) 

7.0 

26.5 

370 

Table  3-3.  Design  parameters  for  multi-port  memory  cells 


delay  decreases  as  more  ports  are  added.  This  is  because  the  the  size  of  the  pull-down 
transistor  needs  to  be  increased  with  the  number  of  ports,  to  improve  the  stability  (static  noise 
margin)  of  the  cell.  Reduction  in  the  supply  voltage  and  the  word  select  line  (when  read)  has 
little  effect  on  the  noise  margin.  It  can  be  easily  controlled  by  transconductance  ratios  (W/L) 
of  transistors.  Therefore,  with  careful  transistor  sizing  a  multi-port  memory  with  a  perfor¬ 
mance  comparable  to  the  single  or  dual  port  memory  can  be  built  using  the  single-ended 
access  cell  for  botli  reads  and  writes  with  different  operating  voltage  levels. 

As  mentioned  in  Section  2,  the  area  used  to  hold  a  byte  of  data  in  cache  differs  from 
tliat  used  to  hold  a  byte  in  a  register.  Cache  requires  tags  and  state  bits  so  that  it  can  be 
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managed  dynamically  by  the  hardware.  To  compare  the  multi -port  memory  and  multiple  set 
of  register  files  fairly,  this  fact  must  be  taken  into  account.  However,  since  the  purpose  of 
this  section  is  to  examine  the  effectiveness  of  the  proposed  memory  ceU,  I  have  only  com¬ 
pared  the  the  areas  of  cells  with  a  different  number  of  ports.  The  area  (memory  array  only) 
required  by  multiple  sets  of  register  files  can  be  estimated  by  simply  multiplying  the  total 
number  of  ports  required  divided  by  the  number  of  ports  in  the  single  set.  As  can  be  seen 
from  the  table,  using  the  proposed  single-ended  cell  we  can  integrate  the  multi-port  memory 
in  smaller  area  than  required  by  the  multiple  set  approach.  To  calculate  the  total  area 
required  for  both  approaches  exactly,  areas  of  other  peripheral  units  such  as  decoders 
(approximately  the  same  for  both  approaches)  and  multiplexors  (for  multiple  sets  of  register 
files  to  route  the  register  contents  to  proper  functional  units)  or  tags  (also  multi -port  tag 
memory)  also  must  be  determined. 

3.5.  Summary 

An  important  factor  in  VLSI  system  design  is  the  large  difference  in  available 
bandwidth  between  on-chip  and  off-chip  communications.  The  communication  bottleneck 
caused  by  the  limited  i/o  pin  bandwidth  makes  it  desirable  to  pack  as  much  functionality  as 
possible  into  the  restricted  area  of  a  single  chip.  Small  local  memories  can  improve  the  per¬ 
formance  by  significantly  reducing  the  off-chip  bandwidth  requiremenL  In  a  single-chip 
microprocessor,  silicon  area  is  one  of  the  scarcest  resources,  and  designers  must  use  it 
efficiendy  for  given  constraints  to  maximize  the  performance.  Therefore,  the  organization  of 
local  memory  must  be  effective  and  memory  density  must  be  maximized  at  a  given  silicon 
area. 

In  this  chapter,  two  memory  design  techniques  that  can  improve  the  performance 
without  necessarily  increasing  tlic  use  of  scarce  silicon  area,  are  presented.  Traditionally,  due 
to  reliability  concerns,  only  static  memories  have  been  used  on  a  microprocessor  chip. 
Dynamic  memories  offer  more  bits  per  unit  area  than  static  memories,  but  fundamental  limi- 
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lations  such  as  refreshing  overhead,  have  prevented  their  use  on  a  microprocessor  chip.  Using 
the  selective  invalidation  technique  proposed  here  can  eliminate  the  refreshing  overheads  of 
dynamic  memories,  if  used  as  a  cache  memory  (read-only  or  write-through  cache).  This 
makes  the  replacement  of  static  memory  with  high  density  dynamic  memory  possible,  and 
results  in  better  use  of  scarce  silicon  area.  Trace-driven  simulations  show  an  effectiveness  of 
this  scheme  over  a  simple  invalidation  scheme. 

When  multiple  functional  units  are  used  to  increase  the  performance  by  parallel  execu¬ 
tion,  the  demand  for  a  higher  bandwidth  between  functional  units  and  local  memory  rises 
rapidly.  Since  a  multi-port  memory  is  prohibitively  expensive,  lime-shared  accesses  to  a 
single-port  memory  have  been  used  when  multiple  accesses  are  necessary.  A  single-ended 
access  memory  cell  operated  at  reduced  voltage  levels  can  be  as  safe  and  fast  as  differentially 
accessed  cells.  When  this  ceU  is  used  to  implement  /r-pori  memory  (n  >  2)  it  can  result  in  a 
total  memory  array  area  smaller  than  that  of  multiple  register  files  with  the  same  number  of 
ports  available  to  the  functional  units. 
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Control  Design  Alternatives 


4.1.  Introduction 

In  a  microprocessor  design,  many  of  the  modules  can  be  designed  using  regular  and 
straightforward  design  styles  (ROM,  RAM,  and  bit-sliced  data  path).  However,  the  control 
unit  is  often  the  10%  of  the  chip  area  tliat  takes  90%  of  the  time  to  design.  Alternatively,  if  a 
fast  but  simplistic  approach  is  used  for  design,  a  very  efficient  implementation  will  result 
This  chapter  considers  tlie  automated  synthesis  of  digital  logic  and  especially  the  synthesis  of 
the  types  of  random  logic  seen  in  the  control  unit  for  full-custom  VLSI  microprocessors. 

A  common  approach  to  regularizing  the  design  of  random  control  logic  employs  a 
structured  logic  element,  such  as  PLAs,  to  implement  the  microprocessor’s  control. 
Automatic  PLA  synthesis  tools  have  been  widely  used  for  many  years.  Recent  developments 
in  integrated  circuits  (1C)  CAD  offer  VLSI  designers  a  variety  of  implementation  choices 
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which  have  not  been  available  in  full-custom  VLSI  design.  In  particular,  multi-level  logic 
synthesis  and  optimization  techniques  [Scg87][Bra87]  allow  combinational  logic  to  be 
mapped  into  different  design  styles  (in  multi-level  form)  such  as  standard  cell,  gate-matrix 
[L0L8O],  and  gate-array  designs,  in  addition  to  the  conventional  implementation  style  based 
on  PLAs.  However,  the  relative  merits  of  these  alternatives  for  full-custom  VLSI  micropro¬ 
cessor  design  have  not  been  well  established.  This  chapter  focuses  on  the  evaluation  of  these 
alternatives  for  microprocessor  control  designs.  The  results  should  be  useful  as  a  guide  for 
future  microprocessor  development 

This  chapter  will  begin  with  a  review  of  control  design  strategies.  Section  2  presents 
two  general  approaches  to  implement  control  units  in  microprocessor,  microprogrammed 
control  and  hard-wired  logic  implementation.  Microprogrammed  control  design  has  been 
popular  since  there  are  many  computer-aided  design  tools  help  implementing  it  automati¬ 
cally.  Advances  in  CAD  systems  and  recent  developments  in  computer  architecture,  such  as 
reduced  instruction  set  computers  (RISQ,  suggest  that  a  fast,  hard-wired  implementation  of 
control  logic  is  now  affordable  and  highly  desirable  for  a  high  performance  microprocessor. 
Section  3  discusses  the  automated  synthesis  of  control  functions  and  presents  alternative 
implementations  of  the  hard-wired  control  logic.  Section  4  evaluates  several  prototypes 
implemented  using  the  alternative  methodologies  presented  in  Section  3.  Several  examples 
from  the  SPUR  design  are  used  in  this  investigation.  Section  5  summarizes  the  results 
obtained  from  the  study. 

4.2.  Microprocessor  control 

Microprocessors  generally  consist  of  two  parts;  an  execution  unit  and  a  control  unit. 
Tire  execution  unit  contains  the  resources  needed  to  execute  the  microprocessor’s  instructions 
which  include  the  general  purpose  registers;  the  arithmetic  and  logical  unit  (ALU);  shifter, 
and  instruction  counters.  The  control  unit  "runs"  the  execution  unit  telling  it  what  to  do 
when. 
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Two  general  approaches  to  the  control  unit  design  are  reviewed  and  compared  in  this 
section.  The  objective  is  to  compare  synthesis  systems  of  two  general  approaches  for  the 
automatic  generation  of  control  logic.  Microprogrammed  control  provides  a  flexible  imple¬ 
mentation  using  fast  on-chip  memory  to  store  control  instructions  (microcode),  but  often 
requires  several  cycles  to  execute  one  instniction.  Each  instruction  is  implemented  in  several 
microinstructions  that  must  be  fetched  from  the  storage  (ROM)  and  decoded  in  each  cycle.  A 
hard-wired  implementation  can  perform  better  than  a  microprogrammed  control  because  each 
instruction  is  directly  interpreted  in  hardware  and  can  be  executed  in  a  single  CPU  cycle. 
However,  the  hard-wired  design  approach  has  been  prohibitively  expensive  and  inefficient, 
especially  for  a  large  and  complex  instruction  set  [Anc83].  The  problems  are  largely  due  to 
increased  complexity  as  an  instruction  set  becomes  richer  and  more  features  are  required  to 
implement  it.  This  trend,  however,  is  changing  because  of  newly  developed  CAD  tools  for 
hard- wired  logic  synthesis. 

Advanced  CAD  systems,  such  as  those  for  multi-level  logic  synthesis  and  optimization, 
along  with  automatic  layout  generation  systems,  have  made  it  possible  for  VLSI  designers  to 
re-consider  the  hard-wired  implementation.  In  microprogrammed  implementation,  the  con¬ 
trol  functions  are  described  in  special  high-level  programming  language,  then  compiled  down 
to  microcode  via  various  computer  aids  and  computer-aided  optimizations.  More  recently,  a 
similar  approach  has  become  available  for  the  hard-wired  implementations  (see  Figure  4-1). 
A  designer  states  the  required  behavior  of  the  control  functions  in  a  hardware  description 
language,  such  as  ISP'.  This  description  is  then  automatically  synthesized  into  lower  levels 
in  design  abstraction  hierarchies  (e.g.  logic  gates  or  layout).  This  automated  design  process 
for  hard-wired  implementation  is  analogous  (Figure  4-1)  to  the  design  process  used  in  the 
microprogrammed  implementation  and  makes  tlte  hard-wired  implementation  as  efficient  and 
flexible  as  microprogrammed  control. 
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4.2.1.  Microprogrammed  control 

The  function  of  the  control  unit  in  a  microprocessor  is  to  execute  sequences  of  micro- 
operations  for  the  successful  completion  of  the  processor’s  instructions.  The  control  function 
that  specifies  a  micro-operation  is  a  binary  variable.  During  any  given  time  interval,  certain 
micro-operations  are  to  be  active  while  all  others  remain  idle.  Thus  the  micro-operation  steps 
within  each  time  interval  can  be  represented  by  a  string  of  I’s  and  O’s  called  a  "control  word" 
or  "microinstruction."  For  a  control  unit  in  which  micro-operation  sequences  are  stored  in  a 
memory  such  as  this  form  is  called  microprogrammed  control.  Each  microinstruction  may 
contain  as  many  bits  as  there  are  control  points  in  the  processor  to  control  a  variety  of  com¬ 
ponents  operating  in  parallel  (horizontal  microinstructions).  The  number  of  control  bits  in  a 
microinstruction  word  can  be  reduced  by  grouping  mutually  exclusive  variables  into  fields 
and  encoding  the  k  bits  in  each  field  to  provide  2*  micro-operations  (vertical  microinstruc¬ 
tions).  Each  field  then  requires  a  hardware  decoder  to  produce  the  corresponding  control  sig¬ 
nals. 

The  complexity  of  the  microprocessor  control  is  due  to  the  many  different  micro- 
operations  performed  in  a  given  time  sequence.  Microprogrammed  control  is  an  elegant  and 
systematic  method  for  generating  the  micro-operation  sequences,  especially  for  a  large  and 
complex  instruction  set.  In  practice,  the  use  of  microprogrammed  control  has  been  tied  to  the 
architecture  or  instruction  set  to  be  implemented  [Hop83].  It  is  generally  easier  to  implement 
high  level  complex  instruction  sets  in  microcode,  although  it  may  result  in  slower  implemen¬ 
tation  than  hard-wired  approach.  Simple  or  reduced  instruction  sets  do  not  normally  require  a 
microprogrammed  control.  Instead,  a  fast  and  simple  hard-wired  control  is  used  to  implement 
a  single  cycle  execution  of  all  instructions. 

Microprogramming  provides  several  advantages  such  as  permitting  structured  approach 
to  control  unit  design,  which  greatly  improves  debugging  and  tailorability.  Moreover,  it  can 
be  extended  easily  to  include  additional  instructions  beyond  the  original  instruction  set,  or 
can  emulate  other  instruction  sets.  It  does  so  without  modifying  the  existing  hardware,  other 
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than  the  control  unit.  With  the  continuing  growth  of  semiconductor  processing  technology 
(especially  ROM  and  RAM  designs  on  a  microprocessor  chip),  the  microprogrammed  control 
can  be  a  cost-effective  implementation  for  richer  and  more  complicated  instruction  set 
microprocessors  [BerSl].  Many  microprocessors  exemplify  the  microprogrammed  control 
such  as  the  Motorola  MC680x0  [MMM84].  the  Intel  80386  [Gel87].  and  the  National 
NS32532. 

Although  it  is  an  elegant  and  flexible  approach  to  desigmng  a  complicated  control  unit, 
writing  the  microprogram  has  remained  a  very  difficult  task.  The  problem  is  harder  when 
there  are  many  more  potential  microinstructions  tlian  there  are  regular  processor  instructions. 
Many  computer  aids  have  been  developed,  such  as  the  compiler  and  debugger  for  writing  the 
microprogram.  These  help  the  microprogrammed  control  design  efficient  and  error-free. 
Microprogramming  is  still  one  of  the  efficient  ways  to  design  microprocessor  control,  but 
new  alternatives  avaUable  from  the  automated  hard-wired  control  synthesis  must  be  carefully 
evaluated  and  compared  to  the  microprogramming. 

4.2.2.  Hard- wired  control 

The  term  "hard-wired  control"  refers  to  an  implementation  technique  for  a  microproces¬ 
sor  control  unit,  in  which  conventional  logic  gates,  such  as  NAND  or  NOR,  steer  the  master 
clock  phases  to  the  control  point.  Each  control  point  is  driven  by  a  gate  with  inputs  that  deter¬ 
mine  the  conditions  under  which  that  control  point  is  to  be  activated.  Thus  the  process  of 
control  unit  design  consists  of  listing  the  control  points  to  be  activated  as  a  function  of  the 
master  clock  phases  and  decoding  the  instruction’s  opcode.  The  logic  optimization  teclmique 
can  be  used  to  reduce  tlie  number  of  gates  involved. 

The  control  unit  of  a  simple  instruction  set  microprocessor  can  best  be  implemented  in 
hard-wired  logic.  The  hard-wired  design  will  be  faster  than  the  microprogrammed  design 
built  from  the  same  technology,  since  the  former  does  not  require  the  overhead  of  fetching 
and  decoding  microinstructions  (micro-sequencing  also  complicates  the  design).  Recent 
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Figure  4-1.  Design  processes  of  microprocessor  control  for 
(a)  microprogrammed  control  and  (b)  hard-wired  control. 


VLSI  RISC  microprocessors,  such  as  SPARC  [NaA88]  and  MIPS  R3000  [Mou],  employ 
hard-wired  control  design  to  achieve  the  fastest  possible  cycle  time. 

A  version  of  hard-wired  control  well-suited  to  VLSI  design  has  been  the  programmed 
logic  array  (PLA).  A  two-level  representation  of  logic  functions  can  be  efficiendy  and 
automatically  implemented  with  a  PLA.  Several  optimization  techniques,  such  as  logic 
minimization  [Bra87]  and  topological  optimization  [DcS83],  further  improve  the  quality  of 
the  design.  However,  certain  multi-level  logic  functions  do  not  map  well  into  PLAs.  In  this 
case  a  multi-level  representation  may  lead  to  a  better  implementation  with  a  reduced  gate 
count  and  a  smaller  area  than  the  two-level  PLA  implementation.  A  comparable  or  shorter 
delay  path  is  also  possible  via  an  optimum  allocation  of  gates  (technology  mapping).  In  fact, 
a  two-level  logic  representation  can  be  seen  as  a  special  case  of  multi-level  representations. 
Therefore,  a  general  synthesis  system  for  control  logic  design  should  offer  multi-level  syn¬ 
thesis  tools  which  are  able  to  select  a  two-lcvcl  implementation  whenever  it  is  more  effective 
in  terms  of  area  and/or  speed.  Tlie  multi-level  logic  can  be  mapped  into  many  different 
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design  styles  such  as  the  following:  standard  cells;  gate  matrix;  Weinberger  array  [Wei67]; 
and  gate  arrays.  Several  CAD  systems  are  being  built  for  the  design  of  random  control  logic 
especially  in  multi-level  representation.  In  the  next  section  new  controller  design  strategies 
using  these  systems  are  presented  and  closely  examined. 

4.3.  Alternative  implementations 

A  large  spectrum  of  design  styles  for  the  VLSI  microprocessor  has  evolved,  offering 
wide  ranges  of  expected  tum-around  time,  improved  performance  and  reduced  design  effort. 
Most  microprocessors  use  PLAs  to  implement  the  combinational  part  of  the  control  unit  in 
two-level  logic  representation.  Gate-matrix  and  Weinberger  arrays  (usually  for  NMOS 
design)  are  array  structured  logic  for  standardized  layout  of  multi-stage  combinational  logic 
networics.  Semi-custom  design  styles  such  as  gate  arrays  and  standard  cell  based  designs, 
aim  at  minimal  design  effort  and  faster  tum-around  time.  The  performance  of  semi-custom 
designs  has  not  matched  its  forerunners,  but  has  the  potential  to  be  greatly  improved  through 
recently-developed  multi-level  logic  optimization  techniques  and  automated  layout  genera¬ 
tion  systems  associated  with  semi-custom  designs. 

In  this  section,  I  first  discuss  a  generalized  CAD  melliodology  for  a  hard-wired  imple¬ 
mentation  of  a  microprocessor  control  unit,  then  present  strategies  for  three  different  imple¬ 
mentation  styles  which  can  actually  be  used  in  implementing  real  world  examples.  Three 
styles,  the  PLA-based,  the  standard  cell-based,  and  the  gate-matrix  based,  are  chosen  for  the 
following  reasons;  (1)  These  can  be  easily  adapted  and  mixed  with  full-custom  design  styles; 
(2)  designs  generated  using  tlicse  styles  are  more  silicon  efficient  and  perform  better  than  oth¬ 
ers;  and  (3)  reliable  CAD  tools  are  readily  available  for  these  styles.  The  following  section 
evaluates,  in  various  aspects,  different  implementations  of  these  styles  as  applied  to  examples 
from  tlic  SPUR  design. 
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4.3.1.  Automatic  synthesis  of  control  logic 

A  number  of  strategies  are  employed  to  deal  with  complexities  in  VLSI  design.  One 
most  frequently  used  is  to  divide  the  design  into  parts  such  that  each  can  be  implemented 
using  the  most  appropriate  strategy  (Figure  4-2).  In  general,  desigmng  the  control  part  of  the 
microprocessor  is  quite  different  from  other  parts  such  as  data  path  or  local  memory.  Due  to 
its  complexity,  control  unit  design  requires  many  iterations,  thus  portions  of  its  design  pro¬ 
cess  need  to  be  automated.  Furthermore  building  a  complex  control  logic,  especially  in  hard¬ 
wired  implementation,  requires  optimizations  at  each  step  of  the  design  process,  e.g.  mmim- 
izing  the  area  required  while  reducing  the  delay.  A  generalized  design  process  for  the  control 
logic  involves  three  steps  [NeS86]:  behavioral  synthesis,  logic  synthesis  and  optimization, 
and  layout  generation  (Figure  4-3). 

Behavioral  synthesis  is  a  translation  from  a  behavioral  description  of  the  control 
hardware  to  a  detailed  functional  description  such  as  register  transfer  level.  The  difficulty  in 
this  step  stems  from  the  many  constraints,  design  objectives,  and  design  configurations  to 
consider.  Logic  synthesis  generates  a  logic  network  from  a  functional  description  of  combi¬ 
national  logic.  One  of  the  primary  difficulties  in  this  step  is  in  discerning  which  sections  of 
the  description  imply  pure  combinational  logic,  and  which  parts  arc  intended  to  be  sequential 
logic. 

The  logic  synthesis  step  takes  a  functional  description  as  an  input  and  creates  appropri¬ 
ate  logic  equations  to  implement  the  described  logic.  The  output  of  the  logic  synthesis  is 
minimized  and  mapped  into  logic  structures  or  gates  in  the  logic  optimization  process.  The 
process  of  optimizing  combinational  logic  is  divided  into  two  sections  -  a  technology 
independent  part  in  which  logic  optimization  is  performed,  and  a  tcclmology  mapping  phase 
in  which  the  selection  of  the  gates  to  implement  the  function  is  made.  The  only  implementa¬ 
tion  style  automatically  synthesized  from  a  high  level  description  has  been  the  PLA  imple¬ 
mentation  with  a  traditional  two-level  logic  optimization. 


89 


Figure  4-2.  Separation  of  design  methodologies 
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Figure  4-3.  Design  synthesis  path  of  the  microprocessor  control 


Implementing  logic  in  multiple  level  form  has  several  advantages.  By  manipulating 
multiple  level  logic,  one  can  optimize  the  logic  for  minimum  delay  or  minimum  area 
[Bra87].  As  the  complexity  of  logic  increases,  PLA  implementation  suffers  from  declining 
performance  and  an  increasing  area  requirement  This  is  not  necessarily  the  case  for  multiple 
level  implementation  where  tradeoffs  between  speed  and  area  can  easily  be  made.  Multiple 
level  logic  can  be  implemented  using  a  broad  spectrum  of  design  style  such  as  standard  cell, 
gate  array,  or  other  array  structured  logic  elements,  gate  matrix  and  Weinberger  array. 
Automatic  layout  generation  tools  for  such  technology  arc  mature  enough  to  generate  a  high 
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quality  layout  when  optimization  of  logic  is  properly  done. 


4.3.2.  PLA  based  control  design  -  A  traditional  approach 

PLAs  are  two  dimensional  array  logic  implementing  a  canonical  sum  of  products  two- 
level  combinational  logic  function.  The  PLA  consists  of  two  planes,  the  AND  plane  and  the 
OR  plane.  The  AND  plane  maps  the  primary  inputs  to  the  product  terms  while  the  OR  plane 
maps  the  product  terms  into  the  outputs.  In  practice  both  these  planes  arc  implemented  as 
NOR  structures  with  an  arbitrary  number  of  inputs. 

PLAs  are  among  the  most  popular  structures  for  the  implementation  of  two-level  logic 
functions.  Most  of  the  recent  microprocessors  include  PLAs  in  the  control  part.  Because  of 
their  regular  structures,  PLAs  can  be  laid  out  automatically.  Many  PLA  layout  generators 
have  been  built  based  on  simple  mapping  of  the  Boolean  equations  into  the  layout  of  the 
PLA.  To  obtain  an  effective  design,  several  optimization  techniques  are  necessary.  They 
include  the  following:  logic  minimization;  topological  optimization;  and  layout  and  circuit 
optimizations. 

Logic  minimization  both  reduces  the  area  occupied  by  the  PLA  and  improves  its  electr¬ 
ical  performance  by  minimizing  the  number  of  product  terms  required.  Once  the  logic 
minimization  is  completed,  topological  optimization  can  be  performed  to  minimiz.e  the  core 
array  of  the  PLA.  The  topological  optimization  itself  does  not  contribute  directly  to  the 
implementation  of  the  logic  functions.  The  objective  of  the  topological  optimization  is  to 
"fold"  rows  and/or  columns  of  the  PLA  planes  such  that  multiple  logical  rows  or  columns  can 
share  a  physical  row  or  column  [DcS83].  Tliis  reduces  the  total  number  of  rows  and  columns 
required,  hence  minimizing  the  area  of  the  PLA.  Layout  and  electrical  optimizations  [Hed85] 
concentrate  on  the  performance  of  the  large  PLA.  Tlte  signal  delay  through  a  large  PLA  can 
be  minimized  by  transistor  sizing,  laying  out  interconnects  in  metal  layers,  or  using  a  sense 
amplifier  between  the  planes  and  at  the  output 
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A  microprocessor  control  unit  not  only  includes  random  combinational  logic,  but  also 
requires  sequential  logic  elements,  such  as  latches,  in  order  to  implement  a  control  block  with 
the  finite  state  machines.  PLA-based  finite  state  machines  have  been  used  in  the  design  of 
several  microprocessors.  The  finite  state  machine  uses  a  PLA  to  implement  the  combinational 
part  of  the  logic,  and  the  outputs  (state  bits)  of  the  PLA  are  fed  back  to  the  inputs  of  the  PLA 
via  clocked  registers.  Other  logic  blocks  with  both  combinational  and  sequential  parts  can  be 
implemented  in  a  similar  manner.  However,  the  separation  of  combinational  parts  and 
sequential  parts  of  the  control  logic  is  yet  to  be  automated.  As  described  in  Chapter  2,  the 
control  unit  of  the  SPUR  CPU  followed  this  methodology. 

4.3.3.  Standard  cell-based  control  design 

The  standard  cell  approach  to  VLSI  chip  design  provides  the  designer  a  quick,  flexible 
design.  It  may  be  less  dense  than  a  fuU-custom  designed  chip,  but  it  performs  far  better  than 
gate  array  designs.  A  standard  cell  library  includes  simple  cells  such  as  INVERTER,  NAND, 
and  NOR  gates  as  well  as  complex  gates  such  as  the  AND-OR-INVERT  gate  and  various 
flip-flops.  The  cell  library  greatly  simplifies  the  automated  synthesis  path  by  isolating  tech¬ 
nology  dependencies  from  the  synthesis  system.  All  cells  in  the  standard  cell  library  have 
identical  height  and  variable  width,  depending  upon  the  complexity  and  size  of  each  cell. 
Each  cell  contains  a  completely  interconnected  function  and  can  be  abutted  to  other  cells 
without  any  adjustment.  The  automated  synthesis  system  treats  the  standard  cell  as  an 
abstract  object  like  a  bounding  box  with  terminals,  and  thus  placement  and  routing  of  cells 
becomes  technology  independent. 

The  layout  synthesis  for  standard  cell-based  design  consists  of  three  parts:  (1)  design  of 
the  standard  cell  library,  (2)  logic  minimization  and  technology  mapping  (selection  of  cells), 
and  (3)  placement  and  routing  of  cells.  The  cell  library  is  usually  provided  by  the  silicon 
foundries,  but  can  be  custom  designed  for  specific  needs.  The  size  of  the  library  is  important 
to  achieve  an  optimal  implementation  [Keu87].  Both  logic  minimization  and  optimal  tech- 
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nology  mapping  [Dci87]  rely  heavily  on  different  types  of  standard  ceUs  available  in  the 
library.  The  impact  of  the  cell  library  on  the  final  design  will  be  investigated  in  the  following 
section. 

4.3.4.  Gate  matrix  based  control  design 

The  gate  matrix  layout  style  [L0L8O]  utilizes  the  configuration  of  a  matrix  composed  of 
intersecting  rows  and  columns  to  provide  transistor  placement  and  interconnections  (Figure 
4-4).  The  matrix  format  structure,  which  is  orderly  and  regular,  gives  high  device  packing 
density  and  allows  ease  of  checking  for  layout  errors.  The  columns  of  this  matrix,  imple¬ 
mented  in  polysilicon,  serve  as  a  transistor  gate  and  interconnection.  The  rows  are  imple¬ 
mented  in  diffusion  and  form  transistors  with  a  column  at  the  intersection.  The  pitch  of  the 
columns  and  rows  are  determined  by  the  minimum  separation  allowed  between  polysilicon 
lines  with  contacts  and  transistors  (diffusion)  or  interconnections  (metal),  respectively. 

The  automatic  synthesis  path  for  the  gate  matrix  layout  style  from  a  logical  description 
is  very  similar  to  that  for  PLAs.  Logic  mim’mization  is  done  first,  and  the  optimized  logic 
equation  is  mapped  to  the  gate  matrix.  Like  the  folded  PLA,  topological  optimization 
[DcN87]  can  be  performed  to  reduce  the  total  number  of  rows  and  columns  required.  How¬ 
ever,  unlike  PLAs,  gate  matrix  can  be  used  to  map  multi-level  logic  representation  and  imple¬ 
ment  the  mixed  combinational  and  sequential  circuits.  Latches  or  registers  can  be  laid  out  and 
mixed  with  combinational  parts  inside  the  gate  matrix. 

With  both  multi-level  logic  minimization  and  topological  optimization,  gate  matrix  pro¬ 
vides  very  high  packing  density  for  multi-level  logic  functions,  but  resulting  performance 
may  not  be  as  good  as  with  other  implementations.  This  is  because  the  size  of  the  transistor  is 
fixed  and  several  transistors  in  a  series  connection  can  be  laid  out  inefficienUy  with  very  high 
interconnect  parasitics.  To  obtain  an  optimal  implementation  with  gate  matrix  layout,  both 
an  optimal  partitioning  of  logic  functions  and  electrical  optimization  (or  transistor  sizing)  are 
necessary. 
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Figure  4-4.  Gate  matrix  layout  (from  [LoLSO]) 


4.4.  Evaluation  and  comparison 

4.4.1.  Method  of  evaluation 

To  evaluate  alternative  implementations  rigorously,  correct  performance  measures  must 
be  used.  The  primary  and  more  quantitative  parameters  are  critical  path  timing,  area,  power 
consumption,  and  design  time.  The  secondary  and  more  qualitative  parameters  may  be  the 
flexibility  and  the  testability.  Flexibility  measures  the  ease  of  changing  design,  depending  on 
overhead  associated  with  regenerating  the  layout  from  the  altered  description.  Testability  is 
hard  to  measure  unless  built-in  test  structures  arc  incorporated.  For  purpose  of  evaluation,  I 
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use  the  following  parameters:  critical  path  timing;  area  required;  estimated  power  consump¬ 
tion;  design  time;  and  flexibility.  Some  parameters  can  be  gathered  directly  from  the  resulting 
implementation,  while  others  are  based  on  qualitative  judgement 

The  examples  used  in  this  experiment  are  from  the  SPUR  design: 

(1)  SPUR  Instruction  Unit  Controller  (iu_ctr)  controls  the  fetching  and  prefetching  of  the 
instruction  cache  as  well  as  handles  a  miss. 

(2)  SPUR  CPU  Master  Control  (master_ctr)  controls  the  pipeline  execution  of  CPU 
instructions,  provides  both  cache  controller  and  floating  point  unit  interfaces,  and  han¬ 
dles  traps  and  interrupts. 

(3)  Cache  Controller  Sequencer  (cc_seq)  implements  the  processor  cache  control  functions 
including  access  requests  from  the  CPU,  read  and  writes  on  cache  memories,  and 
translates  a  virtual  address  to  a  physical  address. 

The  SPUR  designs  were  full-custom  with  the  PLA-based  control  implementations,  thus 
only  other  styles  needed  to  be  re-implemented.  The  design  processes  for  different  styles  is 
depicted  in  Figure  4-5.  All  designs  begin  at  the  same  abstraction  level,  the  functional  descrip¬ 
tion  written  in  BDS.  A  set  of  synthesis  tools  built  around  the  OCT  design  database  [Har86] 
at  UCB  are  used  to  create  the  layout  Logic  synthesis  and  optimization  steps  are  identical  for 
all  implementations.  The  hardware  description  of  the  control  function  is  translated  into  logic 
equations  and  optimized  using  CAD  tools  called  bdsyn  and  mis,  respectively.  The  optimized 
logic  is  then  mapped  to  different  layout  styles  witliin  mis,  and  the  final  layout  is  generated 
using  appropriate  layout  generation  toots.  For  array-structured  logic,  topological  optimization 
is  performed  to  further  improve  the  design. 

The  layout  generation  tools  used  include  Wolfe,  a  standard  cell  place  and  route  system 
which  uses  Timberwolfe-SC  for  placement  and  YACR  for  routing,  GEM  (gate-matrix  layout), 
and  MPLA.  The  final  layout  is  transformed  into  another  database  called  magic  [Ous85],  to  be 
extracted  and  evaluated.  Converting  the  database  assures  fair  comparisons  with  already  exist- 
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Figure  4-5.  VLSI  design  environment  with  OCT 


ing  full-custom  PLA-based  implementations  in  magic  format.  Electrical  performance  is 
measured  by  running  timing  analysis  tools,  crystal  and  spice,  on  the  extracted  layout.  Power 
consumption  is  estimated  using  extracted  capacitances  of  all  switching  nodes. 
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4.4.2.  Results  of  evaluation  and  comparison 

Results  from  the  evaluations  are  summarized  in  Tables  4-1,  4-2,  and  4-3.  Both  the 
SPUR  lU  control  (iu_clr)  and  the  master  control  (master_ctr)  of  the  SPUR  CPU  have  four 
different  versions;  One  full-custom  implementation  with  PLAs,  two  versions  of  standard 
cell-based  design  with  different  cell  libraries,  and  one  gate-matrix  implementation.  The 
SPUR  CC  sequencer  (cc_seq)  has  five  different  versions.  One  extra  version  is  implemented 
using  electrically  optimized  PLA.  Sense  amplifiers  are  used  in  the  middle  of  the  PLA  planes 
and  also  at  the  outputs  of  the  PLA.  The  sense  amplifier  reduces  the  voltage  swing  on  the  pro¬ 
duct  term  and  output  lines  where  parasitic  loadings  may  slow  down  the  propagation  delay. 
Two  standard  cell  libraries  were  built  for  this  experiment  and  cells  in  the  libraries  are  listed  in 
Table  4-4.  LIBl  consists  of  only  seven  simple  gates,  while  LIB2  includes  more  complex 
gates  like  XOR  and  AND-OR-INVERT  gates. 

A.  Critical  path  timing 

The  most  important  performance  parameter  of  the  control  design  is  the  critical  path  tim¬ 
ing,  as  it  often  determine  the  cycle  time  of  the  microprocessor.  To  obtain  an  accurate  timing 
performance,  all  circuits  are  analyzed  by  extracting  the  layout  with  all  parasitics  taken  into 
account.  For  the  iu_ctr,  standard  cell-based  designs  perform  very  close  to  the  full-custom 
design,  while  the  gate-matrix  version  would  still  satisfy  the  required  timing.  Consistent 
results  are  also  observed  in  different  master_ctr  implementations.  The  cc_seq  consisting  of 
combinational  logic  only,  on  the  other  hand,  shows  different  results.  PLA  implementation  of 
cc_seq  without  electrical  optimization  (using  sense  amps)  performs  much  worse  than  stan¬ 
dard  cell-based  implementations.  This  indicates  that  for  a  large  combinational  logic  network, 
the  multi-level  implementation  is  better  tlian  tlie  two  level  PLA  implementation,  where 
parasitics  will  have  a  dominant  effect  on  the  critical  path.  The  gate-matrix  version  of  cc_seq 
far  exceeds  the  timing  requirement.  For  reasonable  performance  with  gate-matrix  design,  a 
proper  partioning  of  the  logic  or  electrical  optimization  is  crucial.  This  can  be  seen  in  the 


98 


gate-matrix  versions  of  the  iu_ctr  and  master_cir.  where  proper  partitionings  were  already 
made. 


IU_CTR1 

IU_CTR2 

IU_CTR3 

IU_CrR4 

Design 

Style 

Full  custom 

with  PLA  tools 

Standard 

cell 

Standard 

cell 

Gate 

matrix 

No.  of  cells 
designed 

9 

17 

7 

none 

Total 

No.  of  gates 

6  PLAs  (95/30) 

3  random  logic 

169  (13) 

193 

824 

transistors 

Area 

1300x1400 

(1.00) 

1260x870 

(.602) 

1220x880 

(.590) 

1250x1470 

(.99) 

Total 

Capacitance 

(switching) 

17.2  pF 

31.7pF 

32.8pF 

13.3pF 

Power 

Estimation 

30  mW 

Less  than 

10  mW 

Less  than 

10  mW 

Less  than 

5mW 

Timing 

phil 

phi2 

phi  3 

phi4 

(nsec) 

12.10 

9.85 

7.90 

11.50 

(nsec) 

13.35 

11.35 

9.55 

12.40 

(nsec) 

19.25 

10.35 

11.65 

16.50 

4  wks 

2wks 

2wks 

2wks 

Table  4-1.  SPUR  lU  control 
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MCI 

MC2 

MC3 

MC4 

Design 

Style 

Full  custom 

with  PLA  tools 

Standard 

cell 

Standard 

cell 

Gate 

matrix 

No.  of  cells 
designed 

14 

17 

7 

none 

Total 

No.  of  gates 

5  PLAs  (133/67) 
9  random  logic 

384  (38) 

417 

2310 

transistors 

Area 

1920x3070 

(1.00) 

2570x1530 

(.667) 

2380x1540 

(.622) 

2680x2410 

(1.10) 

Total 

capacitance 

(switching) 

44.45pF 

82.90 

83.97pF 

45.32pF 

Power 

Estimation 

50  mW 

Less  than 

30  mW 

Less  than 

30  mW 

Less  than 

15  mW 

Critical 

path 

timing 

(nsec) 

20.50 

19.00 

20.85 

26.70 

Design 

time 

10  wks 

4wks 

4wks 

4wks 

Table  4-2.  SPUR  master  control 


The  effect  of  library  size  on  the  timing  performance  of  the  standard  cell  implementation 
is  very  small  compared  to  the  difference  in  total  gate  counts.  This  is  because  using  complex 
gates  with  the  large  library  may  reduce  the  total  gate  counts  but  not  necessarily  reduce  the 
delay  times  (since  complex  gates  are  slower  than  simple  gates).  In  fact,  the  conversion  ratio 
of  simple  gates  to  a  complex  gate  is  about  two  to  three  for  all  implementations,  and  complex 
gates  are  about  two  to  three  times  slower  than  simple  gates.  This  proves  tliat  a  large  set  of 
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library  cells  does  not  necessarily  optimize  the  performance.  Therefore,  without  spending  a 
great  deal  of  time  to  design  and  optimize  a  comprehensive  cell  library,  one  can  obtain  a  high 
perfonnance  implementation  with  a  small  number  of  library  cells.  It  is  also  possible  to 
further  improve  the  performance  of  the  standard  cell  version  by  optimizing  the  cells  for  a  par¬ 
ticular  design.  If  the  library  is  small,  optimizing  and  maintaining  the  cell  library  can  be 
greatly  simplified.  In  turn,  this  reduces  overall  design  time.  Logic  minimization  criteria  may 
also  affect  the  timing  performance.  In  this  experiment,  the  same  logic  minimization  steps 
were  used  in  all  different  implementations. 


SEQIA 

SEQIB 

SEQ2 

SEQ3 

SEQ4 

Design 

Style 

PLA 

with  S/A 

PLA 

Standard 

CeU 

Standard 

CeU 

Gate 

matrix 

No.  of  cells 
designed 

1 

1 

17 

7 

none 

Total 

No.  of  gates 

207  p-terms 
36  outputs 

207  p-terms 
36  outputs 

491  (38) 

526 

2526 

transistors 

Area 

2060x3610 

(1.00) 

1240x2130 

(.355) 

1730x3160 

(.735) 

1750x3180 

(.748) 

1940x5360 

(1.40) 

Total 

capacitance 

(switching) 

254.37pF 

119.91pF 

98.82pF 

101.98pF 

26.64pF 

Power 

Estimation 

60mW 

60mW 

Less  than 

20  mW 

Less  than 

20  mW 

Less  than 

5mW 

Critical 

path 

timing 

(nsec) 

18.00 

46.00 

24.50 

26.70 

90.10 

Table  4-3.  SPUR  CC  Sequencer 
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B.  Area 


In  a  single-chip  microprocessor,  chip  area  is  one  of  the  scarce  resources.  Array- 
structured  logic  elements  offer  a  very  high  integration  of  transistors  at  a  given  silicon  area. 
However,  if  multiple  units  of  such  a  structure  are  interconnected,  the  layout  might  not  have 
optimum  area  efficiency.  This  can  be  seen  in  the  area  difference  in  the  PLA-based  design  and 
the  standard  cell-based  design.  Optimizing  placement  and  routing  CAD  tools  produce  better 
results  with  many  smaller  standard  cells  rather  than  a  few  large  PLAs  or  gate  matrix  arrays. 
Even  with  topological  optimization,  the  gate-matrix  and  PLA-based  design  still  require  larger 
areas. 

A  folded  PLA  can  have  a  smaller  core  array  area,  while  not  necessarily  minimizing  the 
area  occupied  by  the  entire  PLA  layout.  This  is  because  the  I/O  buffers  now  must  be  attached 
to  both  sides  of  each  array.  As  previously  mentioned,  to  obtain  a  reasonable  performance  out 
of  gate-matrix  design,  the  logic  being  designed  needs  to  be  partitioned  into  multiple  gate- 
matrices.  Without  partitioning,  the  area  required  by  large  logic  network  may  increase  unrea¬ 
sonably,  as  evidenced  by  the  cc_seq.  However,  partitioning  also  imposes  a  similar  problem  to 
that  encountered  with  multiple  PLAs.  Gate-matrix  design  with  folded  columns  and  rows 
minimizes  the  matrix  area  itself,  but  overhead  associated  with  interconnects  of  multiple 
gate-matrices  reduces  the  area  efficiency. 

As  for  the  timing  performance  of  the  design,  the  effect  of  the  library  size  on  the  area  is 
negligible.  The  total  areas  required  for  iu_ctr  and  master_ctr  are  actually  less  with  the  small 
library  than  the  large  one.  The  same  reason  for  the  timing  performance  also  applies  here. 
Complex  gates  can  replace  two  or  three  simple  gates,  but  their  sizes  are  again  about  two  to 
three  times  larger  than  simple  ones. 

C.  Power  consumption 

There  are  two  components  that  establish  the  amount  of  power  dissipated  in  CMOS 
VLSI  circuits.  These  are: 
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LIBl 

LIB2 

INV(3)* 
2,3,4-input  NOR 
2,3,4-input  NAND 
2-input  AND 
2-input  OR 
2-input  XOR 
2-input  XNOR 

21- AOI 

22- AOI 

21- OAI 

22- OAI 

INV 

2,3,4-input  NOR 
2,3,4-input  NAND 

17  Gates 

7  Gates 

*  3  different  strength  inverters 


Table  4-4.  Standard  cell  library 


(1)  Static  power  dissipation  due  to  leakage  current  or  pseudo-NMOS  circuits  such  as  static 
PLA  with  pull-up  transistor  (PMOS)  always  on. 

(2)  Dynamic  power  dissipation  due  to  switching  transient  current  or  charging  and  discharg¬ 
ing  of  load  capacitances. 

In  PLA-based  design,  static  PLAs  are  the  main  source  for  power  dissipation.  Static 
PLA,  although  it  consumes  more  power  than  other  configurations,  is  usually  fast  and  easy  to 
design.  With  static  PLAs,  the  designer  can  make  straightforward  tradeoffs  between  power 
and  circuit  speed.  Dynamic  PLAs  are  also  fast  and  only  dissipate  dynamie  switching  power, 
but  require  multiple  phases  and  careful  design  to  avoid  timing  hazards.  All  implementations 
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in  this  experiment  used  the  static  PLA.  hence  the  power  consumption  of  the  PLA-based 
design  is  much  greater  Uian  other  designs. 

Both  standard  ceU-based  design  and  gate-matrix  design  use  CMOS  gates.  The  power 
dissipation  of  these  circuits  consists  mostly  of  dynamic  components.  Any  static  dissipation  is 
due  to  the  reverse  biased  leakage  current  that  flows  across  the  junction  between  diffusion 
region  and  the  substrate.  Static  power  dissipation  due  to  leakage  current  for  a  small  circuits 
operating  at  five  volts  is  usually  a  few  nano-watts. 

Dynamic  power  consumption  for  fully  complementary  MOS  circuits  can  be  estimated 
by  the  equation: 

POWCf  JyfUlfnt£  ^totol  ^  f 

where  Cioui  is  the  sum  of  all  capacitances  on  switching  nodes,  V  an  operating  voltage  and  f  a 
switching  frequency  of  the  circuit.  Capacitances  shown  in  Tables  with  power  estimates  are 
the  sum  of  capacitances  on  all  switching  nodes.  Switching  nodes  can  be  identified  easily  from 
the  extracted  layout.  The  total  of  switching  capacitance  consists  of  gate  capacitances  of 
transistors  (input  to  a  gate),  interconnect  wire  capacitance,  and  junction  capacitance  on  the 
output  node  of  the  gate.  Gate-matrix  designs  show  less  power  dissipation  than  others,  but 
this  results  from  the  transistor  sizes  in  the  gate-matrix  design  being  somewhat  fixed,  and  lack 
of  electrical  optimization  in  the  process  of  generating  the  layout. 

D.  Design  time  and  flexibility 

Given  that  the  process  of  designing  a  microprocessor  on  silicon  is  complicated,  the  role 
of  good  VLSI  design  aids  is  to  reduce  the  complexity  and  assure  the  designer  of  a  working 
chip  in  a  reasonably  short  period  of  time.  The  time  spent  to  design  the  control  unit  of  a 
microprocessor  and  the  flexibility  of  the  design  are  closely  related.  The  complexity  of  the 
control  unit  often  requires  several  iterations  of  the  same  design  steps.  It  is,  tlierefore,  desir¬ 
able  to  have  most  of  the  design  steps  automated.  The  automation  of  the  design,  in  turn, 
reduces  the  overall  design  time,  and  increases  the  flexibility  of  the  design  to  easily 
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incorporate  frequent  changes. 

The  three  styles  of  implementation  currenUy  being  evaluated  have  most  of  their  design 
steps  automated.  Improvement  on  the  design  time  using  these  styles  is  tremendous  compared 
to  the  full-custom  approach,  and  allows  VLSI  designers  to  use  hard-wired  control  in 
microprocessor  design.  A  full  design  cycle  of  control  implementation  using  these  styles  from 
a  behavioral  specification  can  take  as  litUe  as  a  few  hours.  This  permits  designers  to  revise 
the  entire  design  as  many  times  as  necessary.  Once  the  logic  is  partitioned  into  combinational 
and  sequential  parts  and  is  optimized,  the  rest  of  the  implementation  is  straightforward,  rely¬ 
ing  on  automatic  layout  synthesis  tools.  Both  standard  cell-based  and  gate-matrix-based 
designs  have  been  produced  by  fnUy  automated  module  generation  tools,  as  weU  as  global 
placement  and  routing  tools.  They  are  therefore  much  more  flexible  and  take  much  less  time 
than  the  full-custom  approach  with  only  automated  PLA  generation  tools.  The  design  times 
shown  in  the  tables  are  based  on  actual  estimates  from  the  SPUR  design  and  time  spent  on 
the  re-implementations. 

A  separate  and  additional  effort  is  required  to  build  the  cell  library  for  the  standard 
cell-based  design.  Building  a  comprehensive,  fully  characterized  cell  library  can  be  a  time- 
consuming  process.  However,  it  has  been  noted  that  with  small  library,  multi-level  logic 
optimization  tools  can  produce  hardware  as  good  as  that  produced  with  a  large  library.  This 
fact,  along  with  other  performance  measures,  makes  the  standard  cell-based  design  more 
attractive  than  other  designs.  It  is  also  easier  to  incorporate  the  sequential  part  of  the  design 
than  others.  With  forthcoming  sequential  logic  synthesis  and  optimization  systems  (or  mixed 
combinational  and  sequential  logic  synthesis  and  optimization),  the  standard  cell-based 
design  will  become  a  prime  choice. 

4.5.  Summary 

The  key  complexity  of  microprocessor  design  stems  from  designing  the  processor  s 
control  unit,  which  may  take  up  a  small  portion  of  the  chip  area  but  can  consume  most  of  the 
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design  time.  For  the  past  decade  or  so,  microprogrammed  control  design  has  been  a  popular 
approach  to  designing  this  complex  portion  of  the  processor,  due  to  its  flexibility  and 
computer-automated  design  processes.  Recently,  various  VLSI  CAD  tools  have  emerged  to 
facilitate  hard-wired  control  design.  Automatic  synthesis  and  optimization  techniques  at  dif¬ 
ferent  abstraction  levels  have  made  several  alternatives  available  to  full-custom  VLSI  design. 
Behavioral  synthesis  and  multi-level  logic  optimization  systems  provide  particularly  efficient 
and  high  performance  hard-wired  logic  implementation  even  with  semi-custom  layout  styles, 
such  as  standard  cell-based  design. 

In  this  chapter,  I  have  examined  alternatives  in  the  hard-wired  control  design  by  re¬ 
implementing  the  control  units  from  the  SPUR  chips  using  different  design  styles,  and  con¬ 
trasting  them  with  the  fuU-custom  version  with  only  PLA  synthesis  tools.  I  found; 

(1)  The  hard-wired  approach  to  the  microprocessor  control  design  has  been  greatly 
improved  by  advanced  VLSI  CAD  tools,  especially  in  design  time  and  the  quality  of  the 
design.  With  these  design  aids,  the  process  of  designing  the  hard-wired  control  has 
shared  the  efficiency  and  flexibility  of  the  microprogrammed  control. 

(2)  With  recent  development  in  multi-level  logic  synthesis  and  optimization  techniques, 
hard-wired  logic  can  be  mapped  not  only  into  a  two-level  PLA  implementation,  but  also 
into  various  multi-level  logic  implementation  styles  which  can  provide  performance 
comparable  to  or  better  than  the  traditional  two-level  PLA  implementation. 

(3)  Among  many  different  implementation  styles,  standard  cell-based  design  has  a  prime 
potential  for  use  as  microprocessor  control.  CAD  tools  built  around  the  standard  ceU- 
based  design  are  also  sound  and  optimizations  occur  at  all  levels  of  abstraction,  such  as 
logic  design,  and  placement  and  routing.  Multi-level  logic  optimization  for  standard 
cell-based  design  is  effective,  and  even  with  a  small  library  it  can  generate  the  optim¬ 
ized  logic  that  can  perform  as  well  as  that  generated  with  a  large  libraiy.  A  small  library 
furtlier  reduces  the  overall  design  time. 
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By  no  means  is  the  evaluation  in  this  experiment  complete.  A  different  result  is  possi¬ 
ble  for  other  designs.  The  benefit  of  using  examples  from  the  SPUR  design  is  that  there  exists 

working  hardware,  which  is  full -custom  designed  and  hence  comparisons  can  be  drawn. 
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Conclusion 


5 


5.1.  Summary 

In  Qiapter  2,  details  of  designing  the  SPUR  CPU  chip  were  described.  The  methodolo¬ 
gies  and  techniques  used  to  maximize  the  performance  of  a  full-custom  VLSI  microprocessor 
provides  an  overview  of  microprocessor  design  strategics.  The  rest  of  the  research  presented 
in  this  thesis  is  developed  from  new  ideas  and  better  alternatives  which  have  become 
apparent  since  the  development  of  the  SPUR  CPU  chip. 

In  Chapter  3,  two  alternative  memory  design  techniques  were  presented:  dynamic 
memory  for  an  on-chip  cache  memory,  and  a  compact  high  bandwidth  memory  with  multiple 
ports.  Selective  invalidation  instead  of  refreshing,  implemented  using  low  overhead  dynamic 
CMOS  circuits,  can  effectively  eliminate  the  need  for  a  periodic  refreshing  of  dynamic 
memory.  With  this  scheme,  the  size  of  an  on-chip  local  memory  can  be  substantially 
increased  wiiliin  a  given  allocation  of  scarce  silicon  area.  Trace-driven  simulations  show  the 
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effectiveness  of  this  scheme  over  a  simple  invalidation  scheme. 

When  multiple  functional  units  are  used  to  increase  the  performance  by  parallel  execu¬ 
tion,  the  demand  fora  higher  bandwidth  between  functional  units  and  local  memory  increases 
rapidly.  Using  multi-port  memory  to  balance  the  bandwidth  required  by  multiple  fiinctional 
units  previously  has  been  very  expensive  due  to  the  large  cell  area  requirement  When  a 
single-ended  access  memory  cell  is  operated  at  reduced  voltage  levels,  it  can  result  in  a  fast, 
stable  memory  while  the  area  required  is  relatively  small.  Several  multi-port  configurations 
are  designed  and  analyzed  to  demonstrate  the  feasibility  of  multi-port  memories  based  on  this 
cell. 

In  Chapter  4,  alternative  implementation  styles  for  microprocessor’s  control  logic  were 
investigated.  Recently,  various  VLSI  CAD  tools  have  emerged  to  facilitate  hard-wired  con¬ 
trol  design.  Automatic  synthesis  and  optimization  techniques  at  different  abstraction  levels 
have  made  several  alternatives  available  to  full-custom  VLSI  design.  Behavioral  synthesis 
and  multi-level  logic  optimization  systems  provide  particularly  efficient  and  high  perfor¬ 
mance  hard-wired  logic  implementation,  even  with  semi-custom  layout  styles,  such  as  stan¬ 
dard  cell-based  design.  I  have  examined  alternatives  in  the  hard-wired  control  design  by  re- 
implcmenting  the  control  units  from  the  SPUR  chips  using  different  design  styles,  and  con¬ 
trasting  them  with  the  full-custom  version  also  available  from  SPUR  designs.  I  found  that 
the  standard  cell-based  design  can  result  in  the  best  implementation  style  among  others  for 
various  aspects  of  resulting  design. 

5.2.  Future  Research 

As  more  chip  area  is  devoted  to  on-chip  memories,  several  different  types  of  local 
memories  arc  being  integrated  in  a  single-chip  microprocessor.  It  is  intcrcsling  to  see  how 
much  memory  needs  to  be  allocated  for  instruction  versus  data,  register  versus  cache,  or  for 
some  other  puq^oscs  such  as  for  memory  management  functions  (e.g.  TLBs). 
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As  to  multi-port  memory  design,  I  have  examined  implementation  issues  only.  How¬ 
ever.  to  investigate  the  overall  performance  tradeoffs  other  issues  must  be  carefully  con¬ 
sidered.  These  include  instruction  format  for  multiple  operations,  and  using  hardware  or 
software  to  balance  the  bandwidth  required  by  multi-port  memory  and  functional  unit,  and/or 
designing  an  optimizing  compiler  that  extend  the  register  allocation  scheme  to  multiple  regis¬ 
ter  files. 

For  automated  synthesis  of  control  logic,  the  impact  of  library  size  for  standard  cell 
based  design  needs  to  be  explored  further  to  determine  an  ideal  library  size,  in  conjunction 
with  optimal  gate  (technology)  mapping.  It  requires  making  generalizations  about  the  cost  of 
creating  and  maintaining  a  library  as  well  as  assumptions  about  application  domain.  Since 
routing  area  is  an  important  component  of  standard  cell  layout,  impact  of  library  size  on  rout¬ 
ing  region  also  need  to  be  investigated. 

Array  structured  logic  such  as  gate  matrix  offers  a  dense  layout  but  is  often  marred  by 
poor  electrical  performance.  For  further  improvement,  an  optimal  partitioning  of  logic  and 
electrical  optimization  such  as  transistor  sizing  are  necessary. 

5.3.  Conclusion 

Optimizing  performance  in  full-custom  VLSI  microprocessor  involves  several  choices, 
including  the  choice  of  the  best  design  methodology  and  the  best  implementation  styles. 
Tliere  is  a  broad  spectrum  of  implementation  styles  that  have  proven  successful  for  the  con¬ 
struction  of  various  modules  in  microprocessor  chip.  In  general,  the  implementation  of 
microprocessor  can  be  divided  into  three  activities:  data  path  design,  control  logic  design, 
and  on-chip  memory  design.  The  latter  has  become  important  as  more  and  more  chip  area  is 
devoted  to  local  memories,  to  minimize  off-chip  communication  traffic  that  uses  the  scarcest 
resources  of  the  microprocessor  chip,  the  i/o  pins.  The  research  presented  in  this  thesis 
focuses  on  implementation  issues  of  on-chip  memory  and  control  logic  of  a  full-custom  VLSI 
microprocessor. 
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Since  the  on-chip  memory  is  limited  in  its  size,  an  optimal  implementation  of  local 
memory  becomes  increasingly  complex.  Many  different  cache  and  register  organizations  are 
proposed  for  various  optimizations.  When  dynamic  memory  is  substituted  for  static  memory, 
the  size  of  on-chip  memory  can  be  increased  without  increasing  the  chip  area.  A  provision 
must  be  made  so  that  the  operation  of  DRAM  may  not  affect  the  processor’s  normal  execu¬ 
tion  hence  hampering  the  performance.  Using  simple  circuit  design  techniques  and  a  small 
modification  of  the  cache,  periodic  refreshing  requirement  of  DRAM  can  be  effectively  elim¬ 
inated.  A  multi-port  memory  facilitates  the  parallel  processing  using  multiple  functional 
units.  SimUarly  with  DRAM  cache  above,  a  simple  circuit  design  technique  can  lead  to  a 
compact  yet  fast  and  stable  multi-port  memory  ceU. 

A  portion  of  research  is  devoted  to  investigating  various  layout  styles.  All  new  design 
methods  aim  for  simplicity  and  regularity.  Full-custom  design  aiming  for  high  performance 
but  taking  long  design  time  can  be  adopted  when  area  or  timing  considerations  are  cntical, 
such  as  in  high  frequency  data  path  design.  Using  automated  synthesis  of  control  logic  with 
semi-custom  design  styles,  particularly  in  multi-level  representation,  makes  the  design  pro¬ 
cess  efficient  and  easy,  and  the  resulting  design  is  comparable  to  that  produced  by  full- 
custom  design.  The  standard  cell  based  design  style,  when  combined  with  multi-level  logic 
optimization,  can  provide  a  resulting  design  as  good  as  full-custom  version  but  in  much 
shorter  design  time.  It  is  also  shown  that  even  with  a  small  size  library,  the  resulting  layout  is 
better  in  both  delay  timing  and  area  than  other  semi-custom  styles. 
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